.. _embeddings: Embeddings ============ .. csv-table:: :header: "Embeddings", "Notes" "`Bag of Words (BoW) `_", "| Supported by ``scikit-learn`` | Defaults to training from scratch" "`Term Frequency Inverse Document Frequency (TfIdf) `_", "| Supported by ``scikit-learn`` | Defaults to training from scratch" "`Document Embeddings (Doc2Vec) `_", "| Supported by ``gensim`` | Defaults to training from scratch" "`Universal Sentence Encoder (USE) `_", "| Supported by ``tensorflow``, see :ref:`requirements` | Defaults to `large v5 `_" ":ref:`compound`", "| Supported by a :ref:`context-free grammar`" "Word Embedding: `Word2Vec `_", "| Supported by these `pretrained embeddings `_ | Common pretrained options include ``crawl``, ``glove``, ``extvec``, ``twitter``, and ``en-news`` | When the pretrained option is ``None``, trains a new model from the given data | Defaults to ``en``, FastText embeddings trained on news" "Word Embedding: `Character `_", "| Initialized randomly and not pretrained | Useful when trained for a downstream task | Enable :ref:`fine-tuning` to get good embeddings" "Word Embedding: `BytePair `_ ", "| Supported by these `pretrained embeddings `_ | Pretrained options can be specified with the string ``__`` | Default options can be omitted like ``en``, ``en_100``, or ``en__10000`` | Defaults to ``en``, which is equal to ``en_100_10000``" "Word Embedding: `ELMo `_", "| Supported by these `options `_ | Defaults to ``original``" "Word Embedding: `Flair `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``news-forward-fast``" "Word Embedding: `BERT `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``bert-base-uncased``" "Word Embedding: `OpenAI GPT `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``openai-gpt``" "Word Embedding: `OpenAI GPT2 `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``gpt2-medium``" "Word Embedding: `TransformerXL `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``transfo-xl-wt103``" "Word Embedding: `XLNet `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``xlnet-large-cased``" "Word Embedding: `XLM `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``xlm-mlm-en-2048``" "Word Embedding: `RoBERTa `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``roberta-base``" "Word Embedding: `DistilBERT `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``distilbert-base-uncased``" "Word Embedding: `CTRL `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``ctrl``" "Word Embedding: `ALBERT `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``albert-base-v2``" "Word Embedding: `T5 `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``t5-base``" "Word Embedding: `XLM-RoBERTa `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``xlm-roberta-base``" "Word Embedding: `BART `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``facebook/bart-base``" "Word Embedding: `ELECTRA `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``google/electra-base-generator``" "Word Embedding: `DialoGPT `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``microsoft/DialoGPT-small``" "Word Embedding: `Longformer `_", "| Supported by these `pretrained embeddings `_ | Defaults to ``allenai/longformer-base-4096``" Tokenization ^^^^^^^^^^^^ In general, text data should be **whitespace-tokenized** before being fed into TextWiser. * The ``BOW``, ``Doc2Vec``, ``TfIdf`` and ``Word`` embeddings also accept an optional ``tokenizer`` parameter. * The ``BOW`` and ``TfIdf`` embeddings expose all the functionality of the underlying scikit-learn models, so it is also possible to specify other text preprocessing options such as ``stop_words``. * Tokenization for ``Doc2Vec`` and ``Word`` splits using whitespace. The latter model only uses the ``tokenizer`` parameter if the ``word_options`` parameter is set to ``WordOptions.word2vec``, and will raise an error otherwise.