Distributional and frequency effects in word embeddings: Recapitulation and next steps

© 2018 Chris Culy, June 2018

chrisculy.net

Overview

This is one of a series of posts. The key results are that frequency effects are pervasive in word embeddings, and that the effects vary by embedding method and to a certain extent by corpus.

Recapitulation

Word embeddings are meant to encode meaning without regard to the frequencies of the words. The underlying assumption is that word meaning is independent of word frequency — the meaning of 'dog' does not depend on how often it is used. (Of course, word frequency can be relevant to other aspects of meaning, such as semantic change.)

However, what we have seen in these posts is that frequency effects are pervasive in word embeddings, across methods and corpus size. In other words, word embeddings do not encode meaning without regard to the frequencies of words. Here are some of the frequency effects from the previous posts:

  • Vectors encode frequency information for all but low frequency words
    • individual dimensions encode little information
  • The distributions of word similarities have a positive mean
  • Frequency is related to similarity
  • Frequency is related to rank
  • Frequency is related to "reciprocity"
  • Hubs show frequency effects

Another set of results has to do with the embedding methods:

  • Skip-gram with negative sampling (sgns) and FastText (ft) often show similar effects
  • Glove and positive pointwise mutual information (ppmi) often show similar effects
  • sgns and ft often show different effects from glove and ppmi

Since particular frequency effects vary by embedding method, a given frequency effect is not due to language, or the corpus, alone, but is due in large part to the embedding method. Sgns and ft are similar algorithmically in trying to opimize predictions, and glove and ppmi are similar in working more directly with counts. These facts make it tempting to attribute the difference in frequency effects to the differences in the two general approaches. On the other hand, [1] have shown the mathematical similarity of sgns and ppmi, which makes the differences harder to understand.

We have also seen that some frequency effects vary by the corpus (shifted similarities and hubs). Since I've only used two corpora, we can't say for sure whether this variation is due to size (though that seems likely) or something else. Similarly the "power law" of nearest neighbors did not hold for the methods and corpora here, even though [2] showed it for a large corpus using a different embdding method, but we cannot say whether the power law is due to the larger size of the corpus in [2] or the method.

Next steps

Given that frequency effects seem to be pervasive in word embeddings, one question is whether they affect uses of word embeddings. In order to determine that, we would need ways to mitigate the frequency effects. In upcoming posts, I will propose a framework for classifying the types of mitigation effects, both those that have been proposed in the literature and some new ones. I will also show that the mitigation effects may have positive effects, improving some evaluation scores.

A second question is whether the frequency effects discussed here occur with other methods and with other corpora. Furthermore, all the frequency effects have been with corpora that are large in English. (Vanity Fair does have some French, and the large corpora used for testing in the literature are also not completely monolingual.) Given these I plan on looking at methods, more corpora, and more languages in the future.

Back to the overview

References

[1] Omar Levy and Yoav Goldberg. 2014. Neural Word Embedding as Implicit Matrix Factorization. In Advances in neural information processing systems 3:2177-2185.

[2] Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.

In [ ]: