Embeddings
Embeddings |
Notes |
---|---|
Supported by
scikit-learn Defaults to training from scratch
|
|
Supported by
scikit-learn Defaults to training from scratch
|
|
Supported by
gensim Defaults to training from scratch
|
|
Supported by
tensorflow , see requirementsDefaults to large v5
|
|
Supported by a context-free grammar
|
|
Word Embedding: Word2Vec |
Supported by these pretrained embeddings
Common pretrained options include
crawl , glove , extvec , twitter , and en-news When the pretrained option is
None , trains a new model from the given dataDefaults to
en , FastText embeddings trained on news |
Word Embedding: Character |
Initialized randomly and not pretrained
Useful when trained for a downstream task
Enable fine-tuning to get good embeddings
|
Word Embedding: BytePair |
Supported by these pretrained embeddings
Pretrained options can be specified with the string
<lang>_<dim>_<vocab_size> Default options can be omitted like
en , en_100 , or en__10000 Defaults to
en , which is equal to en_100_10000 |
Word Embedding: ELMo |
Supported by these options
Defaults to
original |
Word Embedding: Flair |
Supported by these pretrained embeddings
Defaults to
news-forward-fast |
Word Embedding: BERT |
Supported by these pretrained embeddings
Defaults to
bert-base-uncased |
Word Embedding: OpenAI GPT |
Supported by these pretrained embeddings
Defaults to
openai-gpt |
Word Embedding: OpenAI GPT2 |
Supported by these pretrained embeddings
Defaults to
gpt2-medium |
Word Embedding: TransformerXL |
Supported by these pretrained embeddings
Defaults to
transfo-xl-wt103 |
Word Embedding: XLNet |
Supported by these pretrained embeddings
Defaults to
xlnet-large-cased |
Word Embedding: XLM |
Supported by these pretrained embeddings
Defaults to
xlm-mlm-en-2048 |
Word Embedding: RoBERTa |
Supported by these pretrained embeddings
Defaults to
roberta-base |
Word Embedding: DistilBERT |
Supported by these pretrained embeddings
Defaults to
distilbert-base-uncased |
Word Embedding: CTRL |
Supported by these pretrained embeddings
Defaults to
ctrl |
Word Embedding: ALBERT |
Supported by these pretrained embeddings
Defaults to
albert-base-v2 |
Word Embedding: T5 |
Supported by these pretrained embeddings
Defaults to
t5-base |
Word Embedding: XLM-RoBERTa |
Supported by these pretrained embeddings
Defaults to
xlm-roberta-base |
Word Embedding: BART |
Supported by these pretrained embeddings
Defaults to
facebook/bart-base |
Word Embedding: ELECTRA |
Supported by these pretrained embeddings
Defaults to
google/electra-base-generator |
Word Embedding: DialoGPT |
Supported by these pretrained embeddings
Defaults to
microsoft/DialoGPT-small |
Word Embedding: Longformer |
Supported by these pretrained embeddings
Defaults to
allenai/longformer-base-4096 |
Tokenization
In general, text data should be whitespace-tokenized before being fed into TextWiser.
The
BOW
,Doc2Vec
,TfIdf
andWord
embeddings also accept an optionaltokenizer
parameter.The
BOW
andTfIdf
embeddings expose all the functionality of the underlying scikit-learn models, so it is also possible to specify other text preprocessing options such asstop_words
.Tokenization for
Doc2Vec
andWord
splits using whitespace. The latter model only uses thetokenizer
parameter if theword_options
parameter is set toWordOptions.word2vec
, and will raise an error otherwise.