![]() The implementation of SentencePiece is fast enough to train the model from raw sentences. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance. Previous sub-word implementations assume that the input sentences are pre-tokenized. The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character. Subword-nmt that uses the number of merge operations. Note that SentencePiece specifies the final vocabulary size for training, which is different from That the final vocabulary size is fixed, e.g., 8k, 16k, or 32k. Unlike most unsupervised word segmentation algorithms, whichĪssume an infinite vocabulary, SentencePiece trains the segmentation model such Neural Machine Translation models typically operate with a fixed The number of unique tokens is predetermined Here are the high level differences from other implementations. SentencePiece supports two segmentation algorithms, byte-pair-encoding (BPE) and unigram language model. SentencePiece is a re-implementation of sub-word units, an effective way to alleviate the open vocabulary Note that BPE algorithm used in WordPiece is slightly different from the original BPE. Comparisons with other implementations Feature NFKC-based normalization: SentencePiece performs NFKC-based text normalization.įor those unfamiliar with SentencePiece as a software/algorithm, one can read a gentle introduction here.Direct vocabulary id generation: SentencePiece manages vocabulary to id mapping and can directly generate vocabulary id sequences from raw sentences.Self-contained: The same tokenization/detokenization is obtained as long as the same model file is used.Fast and lightweight: Segmentation speed is around 50k sentences/sec, and memory footprint is around 6MB. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |