About mamba paper
This design inherits from PreTrainedModel. Check the superclass documentation for that generic solutions the Operating on byte-sized tokens, transformers scale improperly as each token ought to "go to" to every other token leading to O(n2) scaling legal guidelines, Due to this fact, Transformers prefer to use subword tokenization to lessen the vol