Compound Embedding
A unique research contribution of TextWiser lies in its novel approach in creating embeddings from components, called the Compound Embedding. This method allows forming arbitrarily complex embeddings, thanks to a context-free grammar that defines a formal language for valid text featurization.
The compound embedding is instantiated using a schema which applies two main production rules:
Transform Operation: This operator defines a list of operations. The first of these operations should be an
Embedding
while the rest should beTransformation(s)
. The idea is that theEmbedding
s have access to raw text and turn them into vectors, and therefore the followingTransformation
s need to operate on vectors. In PyTorch terms, this is equivalent to usingnn.Sequential
.Concatenation Operator: This operator defines a concatenation of multiple embedding vectors. This can be done both at word and sentence level. In PyTorch terms, this is equivalent to using
torch.cat
.
More formally, the compound schemas are defined by the following context-free grammar:
A Context-Free Grammar of Embeddings
start → embed_like | merge
embed_like → embed_option | "[" embed_option "," dict "]"
embed_option → BOW | DOC2VEC | TFIDF | USE
merge → "{" TRANSFORM ":" "[" start "," transform_list "]" "}"
| "{" TRANSFORM ":" "[" word_like "," pool_transform_list "]" "}"
| "{" CONCAT ":" "[" concat_list "]" "}"
transform_list → transform_like | transform_like "," transform_list
transform_like → transform_option | "[" transform_option "," dict "]"
transform_option → LDA | NMF | SVD | UMAP
word_like → WORD
| "[" WORD "," dict "]"
| word_option
| "[" word_option "," dict "]"
word_option → FLAIR | CHAR | WORD2VEC | ELMO | BERT | GPT | GPT2 | TRANSFORMERXL | XLNET | XLM | ROBERTA | DISTILBERT | CTRL | ALBERT | T5 | XLM_ROBERTA | BART | ELECTRA | DIALO_GPT | LONGFORMER
pool_transform_list → pool_like
| pool_like "," transform_list
| transform_list "," pool_like
| transform_list "," pool_like "," transform_list
pool_like → POOL | "[" POOL "," dict "]"
concat_list → start | start "," concat_list
TRANSFORM → "transform"
CONCAT → "concat"
BOW → "bow"
DOC2VEC → "doc2vec"
TFIDF → "tfidf"
USE → "use"
WORD → "word"
FLAIR → "flair"
CHAR → "char"
WORD2VEC → "word2vec"
ELMO → "elmo"
BERT → "bert"
GPT → "gpt"
GPT2 → "gpt2"
TRANSFORMERXL → "transformerXL"
XLNET → "xlnet"
XLM → "xlm"
ROBERTA → "roberta"
DISTILBERT → "distilbert"
CTRL → "ctrl"
ALBERT → "albert"
T5 → "t5"
XLM_ROBERTA → "xlm_roberta"
BART → "bart"
ELECTRA → "electra"
DIALO_GPT → "dialo_gpt"
LONGFORMER → "longformer"
LDA → "lda"
NMF → "nmf"
POOL → "pool"
SVD → "svd"
UMAP → "umap"
This grammar captures the universe of valid configurations for embeddings that can be specified in TextWiser.
Note that the dict
non-terminal denotes a valid JSON dictionary, but is left outside this definition for the sake of brevity.
A sample implementation of dict
can be found here.
Example Compound Schema
Consider a compound embedding that achieves the following:
* creates a word2vec
embedding which is then max
pooled to document level
* creates a flair
embedding which is then mean
pooled to document level
* creates a tfidf
embedding reduced in dimensions using nmf
* concatenates these three embeddings together
* decompose the concatenation via svd
The example schema exactly captures this embedding:
example_schema = {
"transform": [
{
"concat": [
{
"transform": [
["word2vec", {"pretrained": "en"}],
"pool"
]
},
{
"transform": [
["flair", {"pretrained": "news-forward-fast"}],
["pool", {"pool_option": "mean"}]
]
},
{
"transform": [
"tfidf",
["nmf", { "n_components": 30 }]
]
}
]
},
"svd"
]
}
# Model: Compound
emb = TextWiser(Embedding.Compound(schema=example_schema))
See the usage example for a runnable notebook with compound embedding.