Most of the word vector approaches (e.g. word2vec, FastText, Glove, etc.) will give different results when training them multiple times on the same corpus with the same parameter settings. This is due to the random sampling done as part of the approach. For word2vec and related approaches, the random sampling can occur in at least 3 places:
With high frequency items in large corpora, the difference in results across runs (I'll mean runs with the same parameter settings for the rest of this post) may well be negliglible or even non-existent. However, for lower frequency items, and for small corpora, the differences can be striking. We might say that these approaches are unstable.
Here's an example, where we use Vanity Fair as our corpus, with just under 311,000 words . We'll look for the 5 words that are the most similar to man (high frequency), woman, pen (medium frequency), and fox (low frequency), and we'll compare them across 5 runs of word2vec with identical parameters.
# imports from gensim import models from gensim.models.fasttext import FastText from scipy.stats.stats import spearmanr #for tables in Jupyter from IPython.display import HTML, display import tabulate
# read sentences fname = 'vanity_fair_pg599.txt-sents-clean.txt' with open(fname) as f: sents = [line.strip().split() for line in f.readlines()] print('%s: %d sentences' % (fname, len(sents)))
vanity_fair_pg599.txt-sents-clean.txt: 13689 sentences
%%bash # Info about the text and the target words f=vanity_fair_pg599.txt-sents-clean.txt wc $f w=man echo "$w:" grep -c $w $f w=woman echo "$w:" grep -c $w $f w=pen echo "$w:" grep -c $w $f w=fox echo "$w:" grep -c $w $f w=happy #used later echo "$w:" grep -c $w $f
13689 310721 1671293 vanity_fair_pg599.txt-sents-clean.txt man: 1799 woman: 346 pen: 415 fox: 12 happy: 162
# make some word2vec models num_runs = 5 sg = 1 #skip ngram (min_count,window,size,workers,downsample) = (2,5,20,2,0.001) wmodels = [models.Word2Vec(sents, sg=sg, min_count=min_count, window=window, sample=downsample, size=size, workers=2) for i in range(num_runs)]
def compare_tops(vmodels,item,topn=5): tops =  for i,m in enumerate(vmodels): tops.append( ["Model %d" %i] + ["<b>%s</b> %0.4f" % x for x in m.wv.similar_by_word(item, topn=topn, restrict_vocab=None)] ) display(HTML(tabulate.tabulate(tops, tablefmt='html', headers=[item]+list(range(1,topn+1))))) def compare_all(vmodels,item): sims = [m.wv.similar_by_word(item, topn=False) for m in vmodels] rhos =  for i,sim in enumerate(sims): row = ["<b>Model %d</b>" % (i+1)] for j in range(0,i+1): row.append("%0.4f" % spearmanr(sims[i],sims[j])) rhos.append(row) display(HTML(tabulate.tabulate(rhos, tablefmt='html', headers=[item]+["Model %d" % i for i in range(1,topn+1)])))
topn=5 wd = 'man' display(HTML("<h4>%d most similar words</h4>" %topn)) compare_tops(wmodels,wd,topn=topn) display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd)) compare_all(wmodels,wd)
|Model 0||gentleman 0.8581||devil 0.8446||character 0.8424||fellow 0.8402||nobleman 0.8380|
|Model 1||gentleman 0.8539||nobleman 0.8519||devil 0.8506||character 0.8501||sense 0.8466|
|Model 2||gentleman 0.8537||devil 0.8529||character 0.8503||nobleman 0.8491||true 0.8455|
|Model 3||devil 0.8532||gentleman 0.8511||character 0.8481||nobleman 0.8474||sense 0.8461|
|Model 4||devil 0.8511||gentleman 0.8510||character 0.8497||nobleman 0.8474||true 0.8435|
|man||Model 1||Model 2||Model 3||Model 4||Model 5|
topn=5 wd = 'woman' display(HTML("<h4>%d most similar words</h4>" %topn)) compare_tops(wmodels,wd,topn=topn) display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd)) compare_all(wmodels,wd)
|Model 0||girl 0.9211||heart 0.8999||creature 0.8954||soul 0.8808||simple 0.8760|
|Model 1||girl 0.9217||creature 0.8988||heart 0.8985||soul 0.8818||simple 0.8813|
|Model 2||girl 0.9217||heart 0.8981||creature 0.8962||soul 0.8816||simple 0.8793|
|Model 3||girl 0.9221||heart 0.8983||creature 0.8957||soul 0.8819||simple 0.8816|
|Model 4||girl 0.9221||heart 0.9034||creature 0.8990||soul 0.8879||simple 0.8848|
|woman||Model 1||Model 2||Model 3||Model 4||Model 5|
topn=5 wd = 'pen' display(HTML("<h4>%d most similar words</h4>" %topn)) compare_tops(wmodels,wd,topn=topn) display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd)) compare_all(wmodels,wd)
|Model 0||permission 0.9862||villain 0.9861||console 0.9854||conscience 0.9813||strength 0.9810|
|Model 1||permission 0.9867||console 0.9854||villain 0.9853||seek 0.9814||conscience 0.9810|
|Model 2||permission 0.9859||villain 0.9859||console 0.9849||conscience 0.9823||seek 0.9809|
|Model 3||permission 0.9861||villain 0.9857||console 0.9856||conscience 0.9815||seek 0.9807|
|Model 4||villain 0.9859||permission 0.9858||console 0.9848||conscience 0.9815||seek 0.9805|
|pen||Model 1||Model 2||Model 3||Model 4||Model 5|
topn=5 wd = 'fox' display(HTML("<h4>%d most similar words</h4>" %topn)) compare_tops(wmodels,wd,topn=topn) display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd)) compare_all(wmodels,wd)
|Model 0||reconcilement 0.9888||pew 0.9887||milor 0.9879||balance 0.9878||mild 0.9875|
|Model 1||reconcilement 0.9890||pew 0.9884||observation 0.9879||diplomatist 0.9878||evidently 0.9876|
|Model 2||reconcilement 0.9891||pew 0.9884||balance 0.9882||observation 0.9881||diplomatist 0.9880|
|Model 3||reconcilement 0.9884||pew 0.9883||balance 0.9873||diplomatist 0.9872||evidently 0.9871|
|Model 4||reconcilement 0.9887||pew 0.9882||milor 0.9873||balance 0.9873||observation 0.9872|
|fox||Model 1||Model 2||Model 3||Model 4||Model 5|
From these examples, we can see that although different runs of word2vec with same parameters are similar (Spearman rho > 0.99), they are not identical. Not only do the similarity numbers differ, but even the relative ranking of the items may differ from one run to the next. For pen, even the most similar word differs across runs. These differences are problematic when we are trying to get an idea of how particular words are used in a small corpus: why should we choose one run over another?
There are a few different ways we can avoid the unstableness of word2vec et al. One way is to use an approach that does not use randomization, e.g. the ppmi_svd approach, which we will return to in the next post. Another way is to fix a random seed, so that even though randomness is used, it is the same randomness every time we run the model. Yet another way is to elimate (as much as possible) the use of randomness in an approach, e.g. by not downsizing, not using negative sampling, etc. (See  for much more detailed discussion.) However, these "tricks" defeat the purpose of using randomness in the first place, and are thus unsatisifying.
However, we can use the small size of the corpora to our advantage. We can come up with a "consensus" model by taking the average of a number of models, i.e finding the centroid of the models. We can either use a fixed number of models, or we can iteratively update the centroid model by setting a threshold for similarity across iterations.
In the example below, we will iteratively update a centroid model as follows:
def compare2models(m1,m2,wd,n=10): """ compute spearman r for the n closest words to wd """ if wd not in m1.wv.vocab or wd not in m2.wv.vocab: return((float("NaN"),float("NaN"))) #oov tops1 = m1.wv.similar_by_word(wd, topn=n, restrict_vocab=None) tops2 = m2.wv.similar_by_word(wd, topn=n, restrict_vocab=None) tops_both = set.intersection(set([x for x in tops1]), set([x for x in tops2])) #need intersection since there might be differences in vocab size ranks1 = [m1.wv.rank(wd,w) for w in tops_both] ranks2 = [m2.wv.rank(wd,w) for w in tops_both] return spearmanr(ranks1, ranks2) def compare_models_words(m1,m2,wds,n=10): """ compute spearman r for the n closest words to to each wd in wds """ return( [(wd,compare2models(m1,m2,wd,n)) for wd in wds] ) def update_centroid(m1,m2,n): """ return m2 *modified* to be the weighted average of (n*m1 + m2)/(n+1) i.e. we're using this to iteratively update an average """ m2.wv.vectors = (n*m1.wv.vectors + m2.wv.vectors)/(n+1) m2.init_sims() return(m2) def iterate_centroid(sents,wds, params=(2,5,20,2,0.001), sg=1, n=10, runs=5, threshold=0.99, method="word2vec", show_progress=True): """ iterate runs with same parameters on words with n most similar params is: (min_count,window,size,workers,sample) = (2,5,20,2,0.001) after each run, average with previous iteration to create modified model threshold is the average spearman rho for the wds in the models method is either "word2vec" or "FastText" return a list of: the centroid model, whether we converged, and the first model """ if method is None or method is "word2vec": modeler = models.Word2Vec elif method is "FastText": modeler = FastText else: raise ValueError("Unknown method: %s" % method) (min_count,window,size,workers,sample) = params firstm = modeler(sents, sg=sg, min_count=min_count, window=window, sample=sample, size=size, workers=2) prevm = firstm for i in range(runs): currm = modeler(sents, sg=sg, min_count=min_count, window=window, sample=sample, size=size, workers=workers) newm = update_centroid(prevm,currm,i) #currm *must* be second arg, since it gets modified by average_runs total = 0 met_thresh = True for x in compare_models_words(m1=prevm,m2=newm,wds=wds,n=n): total += x met_thresh = met_thresh and (x > threshold) if show_progress: print("Mean rho:\t%0.20f" % (total/len(wds))) if met_thresh: break prevm = newm return((newm,met_thresh, firstm)) def show_iterated_centroid(sents, wds, n=10, runs=40, threshold=0.99, method="word2vec"): """ show the results of constructing an iterated centroid using wds for the convergence return the centroid, or the first model if it didn't converge """ (m,OK,firstm) = iterate_centroid(sents,words, n=n, runs=nruns, threshold=thresh, method=method) if OK: print("\n%s centroid model" % method) vecs = m.wv for w in words: print(w) if w not in vecs.vocab: print("\tOOV") continue for x in vecs.similar_by_word(w, topn=n, restrict_vocab=None): print("\t%s\t%f" % x) print() else: print("Didn't get to each spearman rho of %0.6f" % thresh) print("\n%s first model" % method) vecs = firstm.wv for w in words: print(w) if w not in vecs.vocab: print("\tOOV") continue for x in vecs.similar_by_word(w, topn=n, restrict_vocab=None): print("\t%s\t%f" % x) print() m=firstm return(m)
words=["man","woman","pen","fox","sit","about"] word2="happy" n = 5 nruns = 40 thresh = 0.99
method="word2vec" m = show_iterated_centroid(sents, words, n=n, runs=nruns, threshold=thresh, method=method) #save it for future use fnamev = fname + "-" + method + ".vecs" m.wv.save(fnamev) #we'll make another one for future use as well (m2,_,_) = iterate_centroid(sents,words, params=(10,10,100,2,0.001), sg=1, n=10, runs=5, threshold=0.99, method="word2vec", show_progress=False) fnamev2 = fname + "-" + method + "-win10-dim100-thresh10.vecs" m2.wv.save(fnamev2) print("New word:",word2) for x in m.wv.similar_by_word(word2, topn=n, restrict_vocab=None): print("\t%s\t%f" % x)
Mean rho: 0.75000000000000000000 Mean rho: 0.93333333333333312609 Mean rho: 0.96666666666666645202 Mean rho: 0.98333333333333328152 Mean rho: 0.96666666666666667407 Mean rho: 0.96666666666666645202 Mean rho: 0.99999999999999988898 word2vec centroid model man devil 0.853418 gentleman 0.852735 character 0.850795 nobleman 0.847673 sense 0.846414 woman girl 0.921476 heart 0.902369 creature 0.897813 soul 0.883578 simple 0.880575 pen permission 0.986083 villain 0.985740 console 0.985627 seek 0.981510 conscience 0.981340 fox pew 0.988788 reconcilement 0.988715 balance 0.987476 observation 0.987269 diplomatist 0.987256 sit dine 0.968200 fetch 0.967894 carry 0.957296 ride 0.954534 wait 0.953989 about companions 0.863566 kindly 0.860787 doing 0.857813 talking 0.856786 whispered 0.855696 New word: happy quiet 0.955341 possible 0.952116 thinking 0.948815 pleasant 0.948248 thoughts 0.939583
method="FastText" m = show_iterated_centroid(sents,words, n=n, runs=nruns, threshold=thresh, method=method) #save it for future use fnamev = fname + "-" + method + ".vecs" m.wv.save(fnamev) print("New word:",word2) for x in m.wv.similar_by_word(word2, topn=n, restrict_vocab=None): print("\t%s\t%f" % x)
Mean rho: 0.94999999999999984457 Mean rho: 0.99999999999999988898 FastText centroid model man woman 0.941212 madman 0.937547 human 0.931187 nobleman 0.928819 irishwoman 0.921156 woman womanhood 0.956260 human 0.953708 womankind 0.951916 kinswoman 0.951080 gentlewoman 0.949496 pen pays 0.986974 risen 0.986107 hasten 0.985363 beaten 0.984248 forsaken 0.983837 fox naivete 0.989489 fowl 0.989349 combat 0.987885 desks 0.987567 cot 0.987088 sit wait 0.982080 drop 0.978920 sell 0.972062 stanhope 0.970823 run 0.968761 about abode 0.907996 overhear 0.902718 tallow 0.902074 jabotiere 0.901981 howl 0.899920 New word: happy unhappy 0.978972 possibly 0.970741 cleverly 0.969068 far 0.968470 probably 0.967524
The differences between word2vec and FastText are quite striking, but they are not surprising, given that FastText is designed to find similarities among morphologically related words (e.g. happy and unhappy). Another thing to note is that in informal testing, FastText seems to converge to the desired threshold for the centroid in fewer interations than word2vec. However, since FastText takes longer to run, there is no clear cut speed advantage.
I should note that the convergence procedure used here is not guaranteed to converge, in particular in the case when the generated models are further from the centroid than the threshold. In practice, using uncommon words can lead to non-convergence, but using medium to very common words seems to work well. An alternative to using the closest n words would be to compare the given word(s) to the entire vocabulary. In addition, instead of having a fixed set of comparison words, we could pick a random set of words instead, maybe at each iteration.
To sum up, using the (approximate) centroid model is a way to have the best of both worlds: we can use the random aspects of the word vector approaches and find a model that smooths over the unstableness of the approaches. Given that it may take several iterations to find the centroid model, this approach may not be feasible for large corpora, but it is definitely feasible for small corpora.
 Johannes Hellrich and Udo Hahn. 2016. Bad Company—Neighborhoods in Neural Embedding Spaces Considered Harmful. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2785–2796, Osaka, Japan, December 11-17 2016.