Word vectors with small corpora:¶

A new measure for evaluation¶

© 2018 Chris Culy, March 2018¶

chrisculy.net ¶

Overview¶

This is one of a series of posts on using word vectors with small corpora. In this post I propose a simple measure for evaluating word vectors to take into account the limited vocabulary of small corpora.

Download as Jupyter notebook

Show Code

Background¶

One way of evaluating word vectors is to apply them to tasks that people do, and compare those results to people's results. Two common tasks are similarity (or relatedness) and analogies. For similarites, the task is to judge how similar (or related) two words are. For analogies, the task is to fill in the missing term in a series "A is to B as C is to —". In this post, I will focus on similarity, but the measure proposed here can be applied to analogies as well, and to other types of evaluation. In the discussion below, following [1], I will use four standard testsets: ws353 [2], ws353_similarity [3], ws353_relatedness [4], and bruni_men [5].

Here are a couple of examples of applying the word vectors to the word similarity task, using the centroid models for Vanity Fair that we created previously

from gensim import models

#for tables in Jupyter
from IPython.display import HTML, display
import tabulate

vfair_w2v_f = 'vanity_fair_pg599.txt-sents-clean.txt-word2vec.vecs'
vfair_ft_f = 'vanity_fair_pg599.txt-sents-clean.txt-FastText.vecs'


vfair_w2v = models.KeyedVectors.load(vfair_w2v_f)
vfair_ft = models.KeyedVectors.load(vfair_ft_f)

def compare_pairs(vecs,pairs):
    what = [list(p) + [vecs.similarity(p[0],p[1])] for p in pairs]
    #print(what)
    display(HTML(tabulate.tabulate(what, tablefmt='html', headers=['word1','word2','similarity'])))

our_pairs = (('woman','girl'),('woman','man'),('happy','sad'),('happy','unhappy'),('sad','unhappy'))

display(HTML('<b>Similarity of pairs of words, using word2vec centroid model of Vanity Fair</b>'))
compare_pairs(vfair_w2v,our_pairs)

display(HTML('<b>Similarity of pairs of words, using FastText centroid model of Vanity Fair</b>'))
compare_pairs(vfair_ft,our_pairs)

We can see that there are differences between the word2vec model and the FastText model. For example, word2vec scores girl as being more similar to woman than man is, while FastText is the opposite. Since absolute scores across models are not comparable, what is important is the relative ranking:

word2vec for woman: girl > man
FastText for woman: man > girl

There has been considerable work in compiling human similarity judgments, using different criteria, including varying between "similar" (e.g. happy and glad) and "related" (e.g. happy and sad). The result is lists of pairs of words with a score reflecting human judgements of their similarity. The evaluation then consists of using the word vectors to compute similarities (as above) and then comparing the rankings of the humans and the word vectors, typically using the Spearman rho measure of correlation. Here's an example using the standard "ws353.txt" set of 353 word pairs.

testsetdir = "testsets/ws/"

def eval_word_pairs_basic(vecs,testset):
    """
    return Spearman rho, recall
    """
    
    pairs = testsetdir + testset
    with open(pairs) as f:
        num_tests = len(f.readlines())
    
    #triple (pearson, spearman, ratio of pairs with unknown words). [from documentation]
    results = vecs.evaluate_word_pairs(pairs, restrict_vocab=len(vecs.vocab), case_insensitive=True)
    
    rho = results[1][0]
    fnd = num_tests - round(num_tests * results[2]/100)
    
    recall = fnd/num_tests
    
    return(rho, recall)

testset = "ws353.txt"

what = [["word2vec"] + list(eval_word_pairs_basic(vfair_w2v,testset)),
       ["FastText"] + list(eval_word_pairs_basic(vfair_ft,testset))]

display(HTML('<b>Correlation with %s similarity testset</b>' % testset))
display(HTML(tabulate.tabulate(what, tablefmt='html', headers=['Spearman rho','recall'])))

There is a slight negative correlation for word2vec and a slight positive correlation for FastText. However, we can also see that only 36.3% of the word pairs were found (the recall) in the word vectors for Vanity Fair. The problem is that word vector testsets are designed to test large corpora of (mainly) contemporary language, while we have a small corpus of 19th century language.

To help better compare word vectors for small corpora, we might like to combine the Spearman rho and recall into a single measure. In other areas of computational linguistics, the F measure is used, which combines precision and recall:

$$F1 = \frac{2 * precision * recall}{precision + recall}$$

I propose using an analogy to the standard F measure. However, since similarity and relatedness are ranked measures, we don't have precision, but rather (typically) the Spearman $\rho$ measure of correlation. We can use this as a proxy for precision, with a slight adjustment. $\rho$ ranges from [-1,1], but for "precision" we need a value in the range [0,1]. We can scale $\rho$ to get a new value $\rho'$ which is in that range:

$$\rho' = \frac{(1 + \rho)}{2}$$

(Another possibility, not explored here, would be to compress all negative spearman values to 0.) Then we can formulate our new sF1 measure as:

$$sF1 = \frac{2 * \rho' * recall}{\rho' + recall}$$

Of course, we could have the usual family of F scores, but sF1 will suffice here.

We can use the sF1 score to compare results across different sets of word vectors and different test sets.

def eval_word_pairs_sF1(vecs,testset):
    """
    return s-F1, Spearman rho, recall
    """
    
    pairs = testsetdir + testset
    with open(pairs) as f:
        num_tests = len(f.readlines())
    
    #triple (pearson, spearman, ratio of pairs with unknown words). [from documentation]
    results = vecs.evaluate_word_pairs(pairs, restrict_vocab=len(vecs.vocab), case_insensitive=True)
    
    rho = results[1][0]
    fnd = num_tests - round(num_tests * results[2]/100)
    
    recall = fnd/num_tests    
    scorrelation = (1+rho)/2 
    
    sF1 = 2 * scorrelation * recall / (scorrelation + recall)
    
    return (sF1,rho,recall)

what = [["<b>Testset</b>","<b>sF1</b>","<b>Spearman</b>","<b>recall</b>","<b>sF1</b>","<b>Spearman</b>","<b>recall</b>"]]
for ts in ["ws353.txt", "ws353_similarity.txt", "ws353_relatedness.txt", "bruni_men.txt"]:
    r1 = [round(val,5) for val in eval_word_pairs_sF1(vfair_w2v,ts)]
    r2 = [round(val,5) for val in eval_word_pairs_sF1(vfair_ft,ts)]
    what.append([ts] + r1 + r2)

display(HTML('<b>Evaluation of similarity with various testsets using word2vec and FastText</b>'))
display(HTML(tabulate.tabulate(what, tablefmt='html', headers=['','word2vec','','','FastText','','',])))

While the differences between word2vec and FastText here are not that striking, they are, after all, using the same parameters. However, when we compare one set of parameters to another, the differences can be stronger.

Here we compare two word2vec models, our original, and a second one.

display(HTML('<b>Parameters for two word2vec models</b>'))
display(HTML('<table><tr><th>model</th><th>min_count</th><th>window</th><th>dimensions</th></tr><tr><td>original</td><td>2</td><td>5</td><td>20</td></tr><tr><td>model2</td><td>10</td><td>10</td><td>100</td></tr></table>'))

vfair_w2v2 = models.KeyedVectors.load('vanity_fair_pg599.txt-sents-clean.txt-word2vec-win10-dim100-thresh10.vecs')

what = [["<b>Testset</b>","<b>sF1</b>","<b>Spearman</b>","<b>recall</b>","<b>sF1</b>","<b>Spearman</b>","<b>recall</b>"]]
for ts in ["ws353.txt", "ws353_similarity.txt", "ws353_relatedness.txt", "bruni_men.txt"]:
    r1 = [round(val,5) for val in eval_word_pairs_sF1(vfair_w2v,ts)]
    r2 = [round(val,5) for val in eval_word_pairs_sF1(vfair_w2v2,ts)]
    what.append([ts] + r1 + r2)

display(HTML('<b>Evaluation of similarity with various testsets with two different word2vec models</b>'))
display(HTML(tabulate.tabulate(what, tablefmt='html', headers=['','original','','','model2','','',])))

The testset bruni_men clearly shows the importance of the sF1 measure. Model2 has a much higher Spearman correlation than the original model, but a much lower recall. As a consequence, the sF1 score for model2 is much lower than the sF1 score for the original model.

The other thing to note is that the performance of each model varies widely across the testsets. This type of variation is also seen using large corpora. However, with small corpora, the issue of low recall is more important, so the use of the sF1 score lets us take recall into account directly.

Discussion and Conclusion¶

The literature and tools for evaluating word vectors on similarity (and analogies and others) are based on the assumption that the word vectors will contain nearly all of the words being tested, and so the rank correlation is (fairly) relied on for comparison. However with small corpora, this assumption does not hold, and so we need another measure, like sF1, to make more useful comparisons.

Back to the introduction

References¶

The testsets are included with the hyperwords package, while the base evaluation is done using gensim. sF1 is my own calculation, obviously

[1] Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, vol. 3, pp. 211–225.

[2] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1):116–131.

[3] Torsten Zesch, Christof Müller, and Iryna Gurevych. 2008. Using wiktionary for computing semantic relatedness. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08, 861–866. AAAI Press.

[4] Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 19–27, Boulder, Colorado, June. Association for Computational Linguistics.

[5] Elia Bruni, Gemma Boleda, Marco Baroni, and Nam Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 136–145, Jeju Island, Korea, July. Association for Computational Linguistics.

[6] Gensim: https://radimrehurek.com/gensim/, published as: Software Framework for Topic Modelling with Large Corpora. Radim Řehůřek and Petr Sojka, Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45-50, 22 May 2010.

[7] Hyperwords: https://bitbucket.org/omerlevy/hyperwords, published as [1]

	word2vec			FastText
Testset	sF1	Spearman	recall	sF1	Spearman	recall
ws353.txt	0.41177	-0.04729	0.36261	0.43301	0.07469	0.36261
ws353_similarity.txt	0.42673	0.01002	0.36946	0.43539	0.05995	0.36946
ws353_relatedness.txt	0.41134	-0.08313	0.37302	0.43734	0.05694	0.37302
bruni_men.txt	0.49454	0.20521	0.41933	0.49299	0.19609	0.41933

	original			model2
Testset	sF1	Spearman	recall	sF1	Spearman	recall
ws353.txt	0.41177	-0.04729	0.36261	0.26883	0.14546	0.17564
ws353_similarity.txt	0.42673	0.01002	0.36946	0.234	0.12348	0.14778
ws353_relatedness.txt	0.41134	-0.08313	0.37302	0.2856	0.14097	0.19048
bruni_men.txt	0.49454	0.20521	0.41933	0.28953	0.45708	0.18067

word1	word2	similarity
woman	girl	0.910268
woman	man	0.831149
happy	sad	0.895334
happy	unhappy	0.897276
sad	unhappy	0.921609

word1	word2	similarity
woman	girl	0.918686
woman	man	0.92156
happy	sad	0.872414
happy	unhappy	0.980426
sad	unhappy	0.909073

model	min_count	window	dimensions
original	2	5	20
model2	10	10	100