This is one of a series of posts. In this post I look at the skew (in the ordinary, non-statistcal sense) of word vector similarities, showing that all four methods (Skip Gram with Negative Sampling (sgns), FastText (ft), Glove, and PPMI) show distorted distributions of similarities. However, the nature of the distortion differs dramatically across methods and across corpus sizes.

**new**The distributions of word similarities have a positive mean**new**sgns and ft have similar distributions**new**glove and ppmi have somewhat similar distributions, very different from sgns and ft**new**random vectors are effectively not skewed- eliminating infrequent words
**new**makes a big difference for sgns and ft, lowering the mean**new**makes a small difference for glove and ppmi**new**corpus size*may*effect distributions, but not always**new**very small corpora (like heartd) can be problematic for sgns and ft

Show Code

In [1]:

```
#imports
from dfewe import *
#for tables in Jupyter
from IPython.display import HTML, display
import tabulate
```

In [2]:

```
# some utilities
def show_title(t):
display(HTML('<b>%s</b>' % t))
def show_table(data,headers,title):
show_title(title)
display(HTML(tabulate.tabulate(data,tablefmt='html', headers=headers)))
```

As I discussed briefly in the introduction, elementary discussions of word embeddings talk about using the cosine of the angle between two word vectors as the most common measure of their similarity (I'll return in a later post to other measures), and that the cosine ranges from -1 to 1. While that is mathematically correct, the *observed* cosine similarities in a corpus do NOT seem to range from -1 to 1, but have a more restricted distribution. Furthermore, that distibution is shifted positively: i.e. vectors are somewhat more likely than average to be similar to each other. This shift is unexpected and I have not found any other mention of it.

In most of the examples in these posts, I'll be using two novels as my corpora: *Vanity Fair* by William Makepeace Thackeray, and *The Heart of Darkness* by Joseph Conrad. My motivation, laid out more fully in an earlier post is that I am interested in how authors use language. In the terms of [1], I take a corpus-centered approach. However, my perspective is slightly different from their characterization in that I am not interested in a book merely as a "sample" of the author's broader language, but rather I am in interested in the author's use of language *in that specfic context*. In a sense, this is the ultimate corpsus-centered approach. Of course, a comparison of contexts could also be interesting, such as across novels, or between fiction and non-fiction, or public versus private writing.

Getting back to the two novels used here, they are both *much* smaller than typical corpora used for constructing word embeddings — vocabularies of thousands rather than hundreds of thousands or more. These small corpora sometimes pose challenges for word embedding methods, as we will see.

In [3]:

```
#set up standard corpora + vectors
vfair_all = Setup.make_standard_sampler_and_vecs('vfair',5,100,1) #window=5, dims=100, min_count=1
heartd_all = Setup.make_standard_sampler_and_vecs('heartd',5,100,1) #window=5, dims=100, min_count=1
what = [['Vanity Fair (vfair)'],['Heart of Darkness (heartd)']]
for i,c in enumerate([vfair_all,heartd_all]):
sampler = c['sampler']
what[i].extend([sum(sampler.counts.values()), len(sampler.counts)])
show_table(what, headers=['Corpus','Tokens','Types'], title="Corpora sizes")
```

**the general points and principles hold for word embeddings for large corora as well**, as I will show in a separate post (and as we already saw in the introduction with the GloVe vectors). In other words, the work here is relevant to *downstream applications* of word embeddings as well.

Let's take a look at (samples from) the distributions of similarities in *Vanity Fair* (aka *vfair*). Each of the four methods is shown in turn.

In [4]:

```
samples = 1000
show_title('Similarity density distributions for vfair, %d samples' % samples)
Plotting.compare_methods_range(vfair_all, samples=samples)
```

The first thing to notice, which was in fact the point of departure, is that all of the distributions have a positive mean, i.e. they are shifted positively. The second thing to notice is that sgns and ft show similar distributions, which are very different from the glove and ppmi distibutions, which in turn are somewhat similar to each other. This division between sgns and ft on the one hand glove and ppmi is one that will show up in almost every example we'll look at. It is *likely* that these differences are due to the predicting vs. counting natures of the methods (sgns and ft are built by optimizing predictions about word co-occurrences, while glove and ppmi are build on the word co-occurrences themselves).

Now let's look at the distributions for *Heart of Darkness* (aka *heartd*).

In [5]:

```
samples = 1000
show_title('Similarity density distributions for heartd, %d samples' % samples)
Plotting.compare_methods_range(heartd_all)
```

*all* the methods for *vfair*, which are unimodal. On the other hand, the other methods show similar distributions across the two corpora sizes.

In [6]:

```
window = 5
dims = 100
min_count = 10
vfair_min10 = Setup.make_standard_sampler_and_vecs('vfair',window,dims,min_count)
heartd_min10 = Setup.make_standard_sampler_and_vecs('heartd',window,dims,min_count)
```

In [7]:

```
show_title('Similarity density distributions for vfair, min_count=%d, %d samples' % (min_count, samples))
Plotting.compare_methods_range(vfair_min10, samples=samples)
```

In [8]:

```
show_title('Similarity density distributions for heartd, min_count=%d, %d samples' % (min_count, samples))
Plotting.compare_methods_range(heartd_min10)
```

Lowering the minimum threshold to 5 for heartd restores sgns but not ft.

In [9]:

```
window = 5
dims = 100
min_count = 5
heartd_min5 = Setup.make_standard_sampler_and_vecs('heartd',window,dims,min_count)
show_title('Similarity density distributions for heartd, min_count=%d, %d samples' % (min_count, samples))
Plotting.compare_methods_range(heartd_min5)
```

We might wonder how word vectors compare with random vectors. Here we create random vectors for the vocabulary of vfair, which each value in each vector is chosen uniformly from the range [-1,1]. We show the similarity distributions for 5 repetitions of this process.

In [10]:

```
for _ in range(5):
random_vecs = Setup.make_random_vecs(vfair_all['sampler'].counts,100,'vfair-random-100.vecs')
random_combo = {
'sampler' : vfair_all['sampler'],
'random' : random_vecs
}
samples = 1000
show_title('Similarity density distributions for random vfair, %d samples' % samples)
Plotting.compare_methods_range(random_combo, methods=['random'], samples=samples)
```

Seeing the similarity distributions of the random vectors gives us a better justification for saying that the word vector distributions are distorted. Random vectors have means very close to 0, either positive or negative, variances close to 0, and skewness also positive or negative. We've seen that sgns and ft are extremely shifted positively, which is one kind of distortion. Glove and ppmi also shifted positively and so distorted in that sense, but they also only have (at least in the examples here) postive skewness, unlike random vectors.

On the other hand, looking at the random vectors gives partial insight into why we don't see the full range of similarities in word vectors: getting exact similarity or dissimarity is extremely rare, so it is not surprising that we don't see it in the word vectors (or in the random vectors). However, we are still left with the other distortions unexplained. I'll return to some of these issues later.

I should note that there is a phenomenon whereby "[i]n a high-dimensional space, all pairwise distances between points seem identical" [2], which also shows that the effect starts with dimension as low as 10. However, I don't think that is what is mainly going on here, since we see *different* effects across methods with the same dimensions. The effect might play a small role, but it's not the explanation for the distortions.

In this look at distributional and frequency effects in word embeddings we've seen that *similarity* distributions are distorted, and that the prediction based methods (sgns and ft) are more distorted than the count based methods (glove and ppmi). Looking at random vectors gave a little insight into the lack of a full range of similarities.

Finally, we also saw that the prediction based methods (sgns and ft) do not work well for very small corpora.

- ➤ Shifted similarity distributions
- Similarity stratification
- Frequency effects of vectors
- Strange geometry
- Distributional effects and hubs
- Recapitulation and next steps

[1] Maria Antoniak and David Mimno. 2018. "Evaluating the Stability of Embedding-based Word Similarities" Transactions of the Association for Computational Linguistics, vol. 6, pp. 107–119.

[2] Damien François, Vincent Wertz, and Michel Verleysen. 2007. IEEE Transaction on Knowledge and Data Engineering. 19:7, 873-876.

In [ ]:

```
```