Overview¶
This is part of ongoing work on word embeddings. In the previous series of posts we saw a wide range of distributional and frequency effects with respect to word embeddings. In this series of posts, I will use a series of summary tests to look at another embedding technique (continuous bag of words) as well as embeddings based on large corprora.
TL;DR: Results and Contributions¶
High level only. More details in the sections
Summary tests
- Summary tests for distributional and frequency effects
- sgns and cbow show different frequency effects with Vanity Fair
Large corpora
- Frequency encoding is stronger for the larger corpora than for Vanity Fair
- Frequency stratfication tends to be stronger for Vanity Fair than for the larger corpora
+Frequency stratification tends to be direct for the larger corpora and indirect for Vanity Fair
- Powerlaw for nearest neighbors is stronger for the larger corpora than for Vanity Fair
- Similarity skewness is moderate