This is one of a series of posts on extending the preceding posts on frequency effects to more embedding techniques and more corpora. In the previous series of posts we saw a wide range of distributional and frequency effects with respect to word embeddings. In this post, I've created some summary tests to make it easier to compare different embeddings. I'll demonstrate these tests by comparing the two different types of embeddings that fall under the rubric of word2vec, namely skip gram with negative sampling (sgns), which I have been using all along, with the continuous bag of words (cbow).
Show Code
%load_ext autoreload
%autoreload 2
from dfewe import *
from dfewe_nb.freq_tests import run_tests as testfs
from dfewe_nb.nb_utils import *
#set up standard corpora + vectors
vfair_all = Setup.make_standard_sampler_and_vecs('vfair',5,100,1) #window=5, dims=100, min_count=1
heartd_all = Setup.make_standard_sampler_and_vecs('heartd',5,100,1) #window=5, dims=100, min_count=1
what = [['Vanity Fair (vfair)'],['Heart of Darkness (heartd)']]
for i,c in enumerate([vfair_all,heartd_all]):
sampler = c['sampler']
what[i].extend([sum(sampler.counts.values()), len(sampler.counts)])
show_table(what, headers=['Corpus','Tokens','Types'], title="Corpora sizes")
There are many different aspects of distributional and frequency effects that we have seen. Here's a list of the major ones:
Vectors
Dimensions
Shifted similaritiess:
Stratification
Hubs
While the preceding posts have used a lot of visualizations during the exploration of the properties, some of the properties can be summarised in a table, as we see here for Vanity Fair using sgns.
smplr = vfair_all['sampler']
vname = 'sgns'
vs = vfair_all[vname]
name = 'vfair with %s' % vname
tests = ['vfreq','sksim','stfreq','strank','strecip']
testfs(name,smplr,vs,tests=tests)
We can quickly see that sgns used with Vanity Fair shows strong frequency effects almost across the board. The one except is stratification of rank, which has a weak effect for the reference term.
Other properties really benefit from visualizations, as we see here for the test for the power law of frequency ranks, the dot product of the similarities with the mean, and the potential for skewness of the dimension values. Again, it's easy to see the results: a rough power law for nearest neighbors, a frequency effect for the dot products with the mean, and lack of skewness in the dimension values.
smplr = vfair_all['sampler']
vname = 'sgns'
vs = vfair_all[vname]
name = 'vfair with %s' % vname
tests = ['vpower','dpmean','dims']
testfs(name,smplr,vs,tests=tests)
Finally, hubs need a combination of table and visualization: the particular hubs in a table and their distribution in a chart.
smplr = vfair_all['sampler']
vname = 'sgns'
vs = vfair_all[vname]
name = 'vfair with %s' % vname
tests = ['hubs','hubp']
testfs(name,smplr,vs,tests=tests)
By having this suite of tests, it is also easy to compare different embeddings, whether using different corpora or different methods. In this section, we'll use the tests to compare the other word2vec method, cbow, with sgns shown above.
First, the table.
vfair_all['cbow'] = Setup.make_vecs('cbow', vfair_all['sampler'].sents, 1,5,100,init_sims=True) #window=5, dims=100, min_count=1
heartd_all['cbow'] = Setup.make_vecs('cbow', heartd_all['sampler'].sents, 1,5,100,init_sims=True) #window=5, dims=100, min_count=1
smplr = vfair_all['sampler']
vname = 'cbow'
vs = vfair_all[vname]
name = 'vfair with %s' % vname
tests = ['vfreq','sksim','stfreq','strank','strecip']
testfs(name,smplr,vs,tests=tests)
What see from the table is that cbow has very different properties from sgns. In particular, it shows only weak to moderate frequency effects, unlike sgns which showed mostly strong frequency effects.
Next the charts. The results for cbow here are also different from those for sgns. The power law is much less clear, and theres is not a consistent pattern of the dot product with the mean, unlike there is with sgns. The only thing in common is the lack of frequency effect for the dimension values.
smplr = vfair_all['sampler']
vname = 'cbow'
vs = vfair_all[vname]
name = 'vfair with %s' % vname
tests = ['vpower','dpmean','dims']
testfs(name,smplr,vs,tests=tests)
Finally, the hubs. The main thing to notice is that the hubs with cbow are a bit more distributed across the percentiles, though they are still concentrated in the 0-10 percentile band.
smplr = vfair_all['sampler']
vname = 'cbow'
vs = vfair_all[vname]
name = 'vfair with %s' % vname
tests = ['hubs','hubp']
testfs(name,smplr,vs,tests=tests)
Given that cbow is so different from sgns with vfair, let's see how they compare on Heart of Darkness.
First the tables. For heartd, the two word2vec approaches are similar, both showing only moderate to weak effects, unlike sgns for vfair. Note too that the direction of the correlations are different between vfair and heartd.
vfair | heartd | |
---|---|---|
Frequency | inverse | direct |
Rank reference | direct | direct |
Rank comparitor | inverse | direct |
Reciprocity | inverse | inverse |
smplr = heartd_all['sampler']
tests = ['vfreq','sksim','stfreq','strank','strecip']
vname = 'sgns'
name = 'heartd with %s' % vname
vs = heartd_all[vname]
testfs(name,smplr,vs,tests=tests)
vname = 'cbow'
name = 'heartd with %s' % vname
vs = heartd_all[vname]
testfs(name,smplr,vs,tests=tests)
Next up: the charts. Once again, the two approachs are somewhat similar. While they both lack a clear power-law relation, with sgns the data seems linear where it does not with cbow.
smplr = heartd_all['sampler']
tests = ['vpower','dpmean','dims']
vname = 'sgns'
name = 'heartd with %s' % vname
vs = heartd_all[vname]
testfs(name,smplr,vs,tests=tests)
vname = 'cbow'
name = 'heartd with %s' % vname
vs = heartd_all[vname]
testfs(name,smplr,vs,tests=tests)
And finally, the hubs, where the approaches are again similar.
smplr = heartd_all['sampler']
tests = ['hubs','hubp']
vname = 'sgns'
name = 'heartd with %s' % vname
vs = heartd_all[vname]
testfs(name,smplr,vs,tests=tests)
vname = 'cbow'
name = 'heartd with %s' % vname
vs = heartd_all[vname]
testfs(name,smplr,vs,tests=tests)
What we have seen is that sgns and cbow differ dramatically on the larger corpus vfair but are very similar on the smaller corpus heartd. There are also differences between vfair and cbow, for example in the trends for the dot product. The summary tests make these differences easy to see.
In the first set of posts we consistently saw differences between vfair and heartd, and we see differences again here with cbow. This raises the question of how much corpus size affects the frequency effects. In the next post, we'll look at some large corpora.
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their composi- tionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States. 3111–3119.