Embeddings

Embeddings

Notes

Bag of Words (BoW)

Supported by scikit-learn
Defaults to training from scratch

Term Frequency Inverse Document Frequency (TfIdf)

Supported by scikit-learn
Defaults to training from scratch

Document Embeddings (Doc2Vec)

Supported by gensim
Defaults to training from scratch

Universal Sentence Encoder (USE)

Supported by tensorflow, see requirements
Defaults to large v5

Compound Embedding

Supported by a context-free grammar

Word Embedding: Word2Vec

Supported by these pretrained embeddings
Common pretrained options include crawl, glove, extvec, twitter, and en-news
When the pretrained option is None, trains a new model from the given data
Defaults to en, FastText embeddings trained on news

Word Embedding: Character

Initialized randomly and not pretrained
Useful when trained for a downstream task
Enable fine-tuning to get good embeddings

Word Embedding: BytePair

Supported by these pretrained embeddings
Pretrained options can be specified with the string <lang>_<dim>_<vocab_size>
Default options can be omitted like en, en_100, or en__10000
Defaults to en, which is equal to en_100_10000

Word Embedding: ELMo

Supported by these options
Defaults to original

Word Embedding: Flair

Supported by these pretrained embeddings
Defaults to news-forward-fast

Word Embedding: BERT

Supported by these pretrained embeddings
Defaults to bert-base-uncased

Word Embedding: OpenAI GPT

Supported by these pretrained embeddings
Defaults to openai-gpt

Word Embedding: OpenAI GPT2

Supported by these pretrained embeddings
Defaults to gpt2-medium

Word Embedding: TransformerXL

Supported by these pretrained embeddings
Defaults to transfo-xl-wt103

Word Embedding: XLNet

Supported by these pretrained embeddings
Defaults to xlnet-large-cased

Word Embedding: XLM

Supported by these pretrained embeddings
Defaults to xlm-mlm-en-2048

Word Embedding: RoBERTa

Supported by these pretrained embeddings
Defaults to roberta-base

Word Embedding: DistilBERT

Supported by these pretrained embeddings
Defaults to distilbert-base-uncased

Word Embedding: CTRL

Supported by these pretrained embeddings
Defaults to ctrl

Word Embedding: ALBERT

Supported by these pretrained embeddings
Defaults to albert-base-v2

Word Embedding: T5

Supported by these pretrained embeddings
Defaults to t5-base

Word Embedding: XLM-RoBERTa

Supported by these pretrained embeddings
Defaults to xlm-roberta-base

Word Embedding: BART

Supported by these pretrained embeddings
Defaults to facebook/bart-base

Word Embedding: ELECTRA

Supported by these pretrained embeddings
Defaults to google/electra-base-generator

Word Embedding: DialoGPT

Supported by these pretrained embeddings
Defaults to microsoft/DialoGPT-small

Word Embedding: Longformer

Supported by these pretrained embeddings
Defaults to allenai/longformer-base-4096

Tokenization

In general, text data should be whitespace-tokenized before being fed into TextWiser.

  • The BOW, Doc2Vec, TfIdf and Word embeddings also accept an optional tokenizer parameter.

  • The BOW and TfIdf embeddings expose all the functionality of the underlying scikit-learn models, so it is also possible to specify other text preprocessing options such as stop_words.

  • Tokenization for Doc2Vec and Word splits using whitespace. The latter model only uses the tokenizer parameter if the word_options parameter is set to WordOptions.word2vec, and will raise an error otherwise.