Embeddings

Embeddings	Notes
Bag of Words (BoW)	Supported by `scikit-learn` Defaults to training from scratch
Term Frequency Inverse Document Frequency (TfIdf)	Supported by `scikit-learn` Defaults to training from scratch
Document Embeddings (Doc2Vec)	Supported by `gensim` Defaults to training from scratch
Universal Sentence Encoder (USE)	Supported by `tensorflow`, see requirements Defaults to large v5
Compound Embedding	Supported by a context-free grammar
Word Embedding: Word2Vec	Supported by these pretrained embeddings Common pretrained options include `crawl`, `glove`, `extvec`, `twitter`, and `en-news` When the pretrained option is `None`, trains a new model from the given data Defaults to `en`, FastText embeddings trained on news
Word Embedding: Character	Initialized randomly and not pretrained Useful when trained for a downstream task Enable fine-tuning to get good embeddings
Word Embedding: BytePair	Supported by these pretrained embeddings Pretrained options can be specified with the string `<lang>_<dim>_<vocab_size>` Default options can be omitted like `en`, `en_100`, or `en__10000` Defaults to `en`, which is equal to `en_100_10000`
Word Embedding: ELMo	Supported by these options Defaults to `original`
Word Embedding: Flair	Supported by these pretrained embeddings Defaults to `news-forward-fast`
Word Embedding: BERT	Supported by these pretrained embeddings Defaults to `bert-base-uncased`
Word Embedding: OpenAI GPT	Supported by these pretrained embeddings Defaults to `openai-gpt`
Word Embedding: OpenAI GPT2	Supported by these pretrained embeddings Defaults to `gpt2-medium`
Word Embedding: TransformerXL	Supported by these pretrained embeddings Defaults to `transfo-xl-wt103`
Word Embedding: XLNet	Supported by these pretrained embeddings Defaults to `xlnet-large-cased`
Word Embedding: XLM	Supported by these pretrained embeddings Defaults to `xlm-mlm-en-2048`
Word Embedding: RoBERTa	Supported by these pretrained embeddings Defaults to `roberta-base`
Word Embedding: DistilBERT	Supported by these pretrained embeddings Defaults to `distilbert-base-uncased`
Word Embedding: CTRL	Supported by these pretrained embeddings Defaults to `ctrl`
Word Embedding: ALBERT	Supported by these pretrained embeddings Defaults to `albert-base-v2`
Word Embedding: T5	Supported by these pretrained embeddings Defaults to `t5-base`
Word Embedding: XLM-RoBERTa	Supported by these pretrained embeddings Defaults to `xlm-roberta-base`
Word Embedding: BART	Supported by these pretrained embeddings Defaults to `facebook/bart-base`
Word Embedding: ELECTRA	Supported by these pretrained embeddings Defaults to `google/electra-base-generator`
Word Embedding: DialoGPT	Supported by these pretrained embeddings Defaults to `microsoft/DialoGPT-small`
Word Embedding: Longformer	Supported by these pretrained embeddings Defaults to `allenai/longformer-base-4096`

Tokenization

In general, text data should be whitespace-tokenized before being fed into TextWiser.

The BOW, Doc2Vec, TfIdf and Word embeddings also accept an optional tokenizer parameter.
The BOW and TfIdf embeddings expose all the functionality of the underlying scikit-learn models, so it is also possible to specify other text preprocessing options such as stop_words.
Tokenization for Doc2Vec and Word splits using whitespace. The latter model only uses the tokenizer parameter if the word_options parameter is set to WordOptions.word2vec, and will raise an error otherwise.