Compound Embedding
A unique research contribution of TextWiser lies in its novel approach in creating embeddings from components, called the Compound Embedding. This method allows forming arbitrarily complex embeddings, thanks to a context-free grammar that defines a formal language for valid text featurization.
The compound embedding is instantiated using a schema which applies two main production rules:
Transform Operation: This operator defines a list of operations. The first of these operations should be an
Embeddingwhile the rest should beTransformation(s). The idea is that theEmbeddings have access to raw text and turn them into vectors, and therefore the followingTransformations need to operate on vectors. In PyTorch terms, this is equivalent to usingnn.Sequential.Concatenation Operator: This operator defines a concatenation of multiple embedding vectors. This can be done both at word and sentence level. In PyTorch terms, this is equivalent to using
torch.cat.
More formally, the compound schemas are defined by the following context-free grammar:
A Context-Free Grammar of Embeddings
start → embed_like | merge
embed_like → embed_option | "[" embed_option "," dict "]"
embed_option → BOW | DOC2VEC | TFIDF | USE
merge → "{" TRANSFORM ":" "[" start "," transform_list "]" "}"
| "{" TRANSFORM ":" "[" word_like "," pool_transform_list "]" "}"
| "{" CONCAT ":" "[" concat_list "]" "}"
transform_list → transform_like | transform_like "," transform_list
transform_like → transform_option | "[" transform_option "," dict "]"
transform_option → LDA | NMF | SVD | UMAP
word_like → WORD
| "[" WORD "," dict "]"
| word_option
| "[" word_option "," dict "]"
word_option → FLAIR | CHAR | WORD2VEC | ELMO | BERT | GPT | GPT2 | TRANSFORMERXL | XLNET | XLM | ROBERTA | DISTILBERT | CTRL | ALBERT | T5 | XLM_ROBERTA | BART | ELECTRA | DIALO_GPT | LONGFORMER
pool_transform_list → pool_like
| pool_like "," transform_list
| transform_list "," pool_like
| transform_list "," pool_like "," transform_list
pool_like → POOL | "[" POOL "," dict "]"
concat_list → start | start "," concat_list
TRANSFORM → "transform"
CONCAT → "concat"
BOW → "bow"
DOC2VEC → "doc2vec"
TFIDF → "tfidf"
USE → "use"
WORD → "word"
FLAIR → "flair"
CHAR → "char"
WORD2VEC → "word2vec"
ELMO → "elmo"
BERT → "bert"
GPT → "gpt"
GPT2 → "gpt2"
TRANSFORMERXL → "transformerXL"
XLNET → "xlnet"
XLM → "xlm"
ROBERTA → "roberta"
DISTILBERT → "distilbert"
CTRL → "ctrl"
ALBERT → "albert"
T5 → "t5"
XLM_ROBERTA → "xlm_roberta"
BART → "bart"
ELECTRA → "electra"
DIALO_GPT → "dialo_gpt"
LONGFORMER → "longformer"
LDA → "lda"
NMF → "nmf"
POOL → "pool"
SVD → "svd"
UMAP → "umap"
This grammar captures the universe of valid configurations for embeddings that can be specified in TextWiser.
Note that the dict non-terminal denotes a valid JSON dictionary, but is left outside this definition for the sake of brevity.
A sample implementation of dict can be found here.
Example Compound Schema
Consider a compound embedding that achieves the following:
* creates a word2vec embedding which is then max pooled to document level
* creates a flair embedding which is then mean pooled to document level
* creates a tfidf embedding reduced in dimensions using nmf
* concatenates these three embeddings together
* decompose the concatenation via svd
The example schema exactly captures this embedding:
example_schema = {
"transform": [
{
"concat": [
{
"transform": [
["word2vec", {"pretrained": "en"}],
"pool"
]
},
{
"transform": [
["flair", {"pretrained": "news-forward-fast"}],
["pool", {"pool_option": "mean"}]
]
},
{
"transform": [
"tfidf",
["nmf", { "n_components": 30 }]
]
}
]
},
"svd"
]
}
# Model: Compound
emb = TextWiser(Embedding.Compound(schema=example_schema))
See the usage example for a runnable notebook with compound embedding.