Distributional and frequency effects in word embeddings: Distributional effects and hubs

© 2018 Chris Culy, June 2018

chrisculy.net

Overview

This is one of a series of posts. In this post I examine the issue of "hubs": words/vectors that are very similar to many other words/vectors. While the existence of hubs is a mathematical property of (certain) vector spaces, I will explore them with regard to whether the hubs that arise in word embeddings have any patterns that are due either to language or to the methods used for creating the word embeddings. One particular result is that hubs show frequency effects; another is that ppmi is very different from the other 3 methods on the very small corpus, heartd.

Results and contribution

  • new word embeddings usually have hubs
    • the exception is with the very small corpus
    • the exception to the exception is that ppmi does have hubs with the very small corpus
  • new the hubs can vary (slightly) by run of a method
  • new the hubs vary by method
  • new the hubs show frequency effects
    • sgns and ft have mostly/only low frequency hubs
    • glove has mostly higher frequency hubs, with a spike at the very lowest frequencies
    • ppmi has hubs in a range of frequencies, with a spike at the very lowest frequencies
  • new stratification of similarities is not sufficient to explain the hub frequency effects

Download as Jupyter notebook

Download supplemental Python code

Show Code

In [1]:
#imports
from dfewe import *

#for tables in Jupyter
from IPython.display import HTML, display
import tabulate
In [2]:
# some utilities
def show_title(t):
    display(HTML('<b>%s</b>' % t))

def show_table(data,headers,title):
    show_title(title)
    display(HTML(tabulate.tabulate(data,tablefmt='html', headers=headers)))

#for dynamic links
links = ('<a href="#link%d">Skip down</a>' % i for i in range(100))
anchors = ('<span id="link%d"></span>' % i for i in range(100))

def make_link():
    display(HTML(next(links)))

def make_anchor():
    display(HTML(next(anchors)))
    
In [3]:
#set up standard corpora + vectors
vfair_all = Setup.make_standard_sampler_and_vecs('vfair',5,100,1) #window=5, dims=100, min_count=1
heartd_all = Setup.make_standard_sampler_and_vecs('heartd',5,100,1) #window=5, dims=100, min_count=1

what = [['Vanity Fair (vfair)'],['Heart of Darkness (heartd)']]
for i,c in enumerate([vfair_all,heartd_all]):
    sampler = c['sampler']
    what[i].extend([sum(sampler.counts.values()), len(sampler.counts)])

show_table(what, headers=['Corpus','Tokens','Types'], title="Corpora sizes")
Corpora sizes
Corpus Tokens Types
Vanity Fair (vfair) 310722 15803
Heart of Darkness (heartd) 38897 5420

Background

The notion of hubs comes from work on k-nearest neighbor classification, and in fact "Hubness is a phenomenon related specifically to nearest-neighbor methods." [1] Since nearest neighbors are an important aspect of the use of word vectors (for example in word similarity and analogy evaluations), it's reasonable to consider the properties of hubs in word embeddings. [2] discuss hubs as a potential problem for evaluation and propose a mitigation strategy, which I will return to in a subsequent post (TBD).

The fundamental observation is that there are some vectors which are very similar more vectors than most vectors are. These vectors are the hubs.

A key notion is that of $NN_{k}(x)$, which is the number of times vector x is one of the k-nearest neighbors of other data points. I'll extend that notion to make the word vectors explicit, and so we can explore the possibility of frequency effects.

$NN_{k}(x,Y)$, where x is a word vector and Y is a set of word vectors, is the number of times x is one of the k-nearest neighbors of a word vector in Y.

Let's see what $NN_{k}$ looks like for some words. (Throughout this post I'll use k=1000; k=1 is small, k=100 is reasonable, and k>1000 doesn't really add anything.)

In [4]:
def nn_k_for_all(k,words,vecs):
    """
    find nn_k for using the whole vocabulary of vecs, using the similarity of vecs
    """

    others = vecs.vocab.keys()
    return Hubs.nn_k(vecs.similar_by_word,k,words,others)

def example1():
    vecs = vfair_all['sgns']
    words = ['carriage','cart','lady','lord','man','woman']
    
    headers = ['Word','NN<sub>1000</sub>']
    d = sorted(nn_k_for_all(1000,words,vecs), key=lambda x: x[0])
    
    show_table(d, headers, '')
    
example1()
Word NN1000
carriage 124
cart 1275
lady 8
lord 27
man 5
woman 17

We can see that cart is one of the 1000-nearest neighbors for over 1000 other words while man is one of the 1000-nearest neighbors of fewer than 10 words.

Hubs then are vectors that have a higher than typical $NN_{k}$, like cart in this example.

Since "higher than typical" is a bit vague, it's worth looking at the distribution of the values of $NN_{k}$, and we'll do so by sampling words in percentile bands, and compare them to a sample from all of the vocabulary. (I'm doing the sampling rather than all the comparisons to save time.) Here's the results for the sgns vectors for vfair.

In [5]:
sampler = vfair_all['sampler']
vecs = vfair_all['sgns']
name = 'vfair with sgns'
Hubs.nn_k_by_percentile(sampler,vecs,name,k=1000,max_words=1000,steps=5,words_per_step=2)

The first thing to notice is that there are some words that clearly qualify as hubs. The second thing to notice is the extreme skewing of the distribution by percentile.

Here's a comparison of all 4 methods for vfair. Notice the sampling makes a difference for $NN_k$ values for vfair compared to the above, but only a slight difference for hubs, and the overall trend is the same.

However the main observation to make is that once again, glove and ppmi have very different trends from vfair and ft, with more variation and less clear hubs.

In [6]:
combo = vfair_all
name='vfair'
Hubs.compare_nn_k_by_percentile(combo,name,k=1000,max_words=1000,steps=5,words_per_step=2)




Now let's look at heartd. Unfortunately, it's hard to say much other than all 4 methods show different trends than with vfair. Sampling may well be the culprit here, so we'll look more carefully below.

In [7]:
combo = heartd_all
name='heartd'
Hubs.compare_nn_k_by_percentile(combo,name,k=1000,max_words=1000,steps=5,words_per_step=2)




Hub basics

We still need to operationalize the notion of hub. One thing that is immediately obvious from all of the graphs above is that $NN_k$ is not a normal distribution, whatever it is. However, can still use standard deviation as a heuristic for finding hubs: we choose a threshold number of standard deviations and if the $NN_k$ is below that threshold, it does not count as a hub. We can then examine the words above the threshold and then either take them all or the highest scoring ones.

Here are the results for vfair, using 4 standard deviations as the threshold, and showing just the top 10 words.

In [8]:
def show_hubs(sampler,vecs,name,k=1000,threshold=4,topn=10):
    """
    show the hubs with at k_nn of at least threshold standard deviations above the mean. 
    if topn is True, show all of them (with that threshold)
    """
    (m,std,df) = Hubs.find_hubs_with_all(sampler,vecs,k=k,thresh=threshold)

    title = "Hubs for %s with k=%d and threshold of %d standard deviations" % (name,k,threshold)
    show_title(title)
    stitle = "Overall mean: %0.4f &nbsp;&nbsp; Overall std: %0.4f" % (m,std)
    show_title(stitle)
    display(HTML(df[:topn].to_html(index=False)))

def compare_hubs(combo,name,k=1000,threshold=4,topn=10):
    """
    do show_hubs for the different methods in combo
    """
    
    Utils.compare_methods(combo,name,show_hubs,k=k,threshold=threshold,topn=topn)
In [9]:
k = 1000
threshold = 4
topn = 10
compare_hubs(vfair_all,'vfair',k=k,threshold=threshold,topn=topn)
Hubs for vfair, using sgns with k=1000 and threshold of 4 standard deviations
Overall mean: 1000.0000    Overall std: 741.5778
word percentile nn_k # stds
rickety 0 4751 5.058134
bumpers 1 4706 4.997453
x 1 4520 4.746636
weighed 1 4435 4.632016
moaning 1 4433 4.629319
bravely 1 4289 4.435138
charmante 0 4278 4.420305
dolly 1 4270 4.409517
backed 1 4266 4.404123
radical 0 4214 4.334003

Hubs for vfair, using ft with k=1000 and threshold of 4 standard deviations
Overall mean: 1000.0000    Overall std: 762.1054
word percentile nn_k # stds
tunic 0 5280 5.616021
defunct 1 5201 5.512361
telegraphic 0 4896 5.112154
spaniel 1 4896 5.112154
energetic 0 4798 4.983563
suburbs 0 4760 4.933701
preliminary 0 4720 4.881215
enthusiasm 3 4715 4.874654
telegraph 0 4692 4.844474
apologise 0 4668 4.812982

Hubs for vfair, using glove with k=1000 and threshold of 4 standard deviations
Overall mean: 1000.0000    Overall std: 734.7789
word percentile nn_k # stds
<unk> 0 6305 7.219859
alge 0 5742 6.453642
bobbins 0 5434 6.034468
westwards 0 5430 6.029024
contrasts 0 5428 6.026302
velvets 0 5355 5.926953
sneaking 0 5353 5.924231
marine 0 5315 5.872515
privateer 0 5293 5.842574
cannibals 0 5284 5.830325

Hubs for vfair, using ppmi with k=1000 and threshold of 4 standard deviations
Overall mean: 1000.0000    Overall std: 191.4769
word percentile nn_k # stds
and 99 2866 9.745302
palatinate 0 2478 7.718948
jackals 0 2358 7.092240
flaring 0 2297 6.773664
legion 0 2275 6.658768
instituted 0 2271 6.637877
mustering 0 2267 6.616987
exploding 0 2256 6.559539
outsides 0 2244 6.496868
enhanced 0 2216 6.350637

Variability of hubs

Variability by run

Various researchers, including [2], [3] any my own earlier post, have found that word similarities vary across runs of algorithms like sgns due to their random aspects. Not surprisingly the same is true for hubs, since they are based on similarities.

As an example, here's a listing of the top 20 words for $NN_{1000}$ for vfair with the sgns vectors from three runs of sgns. While there are some differences, they are slight.

In [10]:
def compare_runs(topn=20):
    vfair_all2 = Setup.make_standard_sampler_and_vecs('vfair',5,100,1) #window=5, dims=100, min_count=1
    vfair_all3 = Setup.make_standard_sampler_and_vecs('vfair',5,100,1) #window=5, dims=100, min_count=1

    (m1,std1,df1) = Hubs.find_hubs_with_all(vfair_all['sampler'],vfair_all['sgns'])
    (m2,std2,df2) = Hubs.find_hubs_with_all(vfair_all2['sampler'],vfair_all2['sgns'])
    (m3,std3,df3) = Hubs.find_hubs_with_all(vfair_all3['sampler'],vfair_all3['sgns'])

    title = 'Comparison of hubs across runs of sgns with vfair'
    what = '<table><tr><th>run 1</th><th>run 2</th><th>run 3</th></tr>'
    what += '<tr><td>' + df1[:topn].to_html(index=False) + '</td>'
    what += '<td>' + df2[:topn].to_html(index=False) + '</td>'
    what += '<td>' + df3[:topn].to_html(index=False) + '</td></tr></table>'

    show_title(title)
    display(HTML(what))
In [11]:
compare_runs(20)
Comparison of hubs across runs of sgns with vfair
run 1run 2run 3
word percentile nn_k # stds
rickety 0 4751 5.058134
bumpers 1 4706 4.997452
x 1 4520 4.746636
weighed 1 4435 4.632015
moaning 1 4433 4.629318
bravely 1 4289 4.435138
charmante 0 4278 4.420304
dolly 1 4270 4.409517
backed 1 4266 4.404123
radical 0 4214 4.334002
gleams 0 4214 4.334002
sympathetic 1 4213 4.332653
approaching 1 4212 4.331305
banners 0 4196 4.309729
cent 1 4162 4.263881
imbibed 0 4127 4.216685
cooks 1 4122 4.209942
ally 0 4106 4.188367
rigidly 0 4093 4.170836
knack 0 4072 4.142518
word percentile nn_k # stds
rickety 0 4727 5.027145
bumpers 1 4615 4.876074
weighed 1 4511 4.735794
x 1 4483 4.698027
moaning 1 4479 4.692631
approaching 1 4373 4.549654
charmante 0 4324 4.483560
dolly 1 4315 4.471421
imbibed 0 4297 4.447142
radical 0 4224 4.348676
bravely 1 4198 4.313606
sympathetic 1 4197 4.312257
barges 0 4185 4.296071
gleams 0 4175 4.282582
knack 0 4159 4.261001
banners 0 4148 4.246164
ally 0 4143 4.239419
backed 1 4139 4.234024
cent 1 4138 4.232675
echoes 1 4115 4.201652
word percentile nn_k # stds
rickety 0 4744 5.047096
bumpers 1 4663 4.937904
moaning 1 4521 4.746481
weighed 1 4505 4.724912
x 1 4452 4.653466
bravely 1 4325 4.482263
approaching 1 4294 4.440474
dolly 1 4284 4.426993
holland 1 4262 4.397336
charmante 0 4254 4.386552
gleams 0 4226 4.348807
sympathetic 1 4216 4.335326
backed 1 4213 4.331282
wander 0 4204 4.319150
imbibed 0 4195 4.307017
cooks 1 4181 4.288144
banners 0 4149 4.245007
contentedly 1 4131 4.220742
cent 1 4131 4.220742
radical 0 4123 4.209958

Variability by method

We can also compare the hubs across the different methods for vfair. What we find is that each method leads to different hubs.

In [12]:
def compare_hub_words(combo,name,k=1000,thresh=4,topn=20):
    """
    compare which words are hubs, among the topn
    """

    sampler = combo['sampler']
    
    vs = ['sgns','ft','glove','ppmi']
    vhubs = dict()

    for v in vs:
        vecs = combo[v]
        (m,std,df) = Hubs.find_hubs_with_all(sampler,vecs,k=k,thresh=threshold)
        if v == 'glove':
            df = df.replace('<unk>', '&lt;unk&gt;') #rename <unk> so it appears in html
        vhubs[v] = set(list(df[:topn]['word']))
        
    show_title("Comparison of top %d hubs for %s, k=%d, threshold=%d stds" % (topn,name,k,thresh))

    maxn = 0
    in_all = set()
    in_some = set()
    for h in vhubs.values():
        in_some |= h
        in_all &= h
        maxn = max(maxn,len(h))
        
    combon = len(in_all)
    
    
    show_title('Distinct hubs: %d Combined overlap: %d' % (len(in_some), combon))
    
    if combon == 0:
        display(HTML('<em>There are no hubs in common among the top %d hubs for each method</em>' % maxn))    
    else:
        if combon == maxn:
            display(HTML('<em>There is complete overlap among the top %d hubs for each method</em>' % maxn))    
        display(HTML('<p>' + '<br>'.join(sorted(list(in_all))) + '</p>'))
            
    if combon < maxn:
        #pairwise comparison
        for v1 in vhubs:
            for v2 in vhubs:
                if v1==v2:
                    break

                what = []
                
                overlap = vhubs[v1] & vhubs[v2]
                first_not_second = vhubs[v1] - vhubs[v2]
                second_not_first = vhubs[v2] - vhubs[v1]
                what += [[', '.join(sorted(list(overlap))),
                         ', '.join(sorted(list(first_not_second))),
                         ', '.join(sorted(list(second_not_first)))]]

                headers = ["%s and %s overlap: %d" % (v1,v2, len(overlap)),
                            "%s but not %s: %d" % (v1,v2, len(first_not_second)),
                            "%s but not %s: %d" % (v2,v1, len(second_not_first))]

                show_table(what,headers,'')
In [13]:
k = 1000
threshold = 4
topn = 50
make_link()
compare_hub_words(vfair_all,'vfair', k=k,thresh=threshold,topn=topn)
make_anchor()
Comparison of top 50 hubs for vfair, k=1000, threshold=4 stds
Distinct hubs: 173 Combined overlap: 0
There are no hubs in common among the top 50 hubs for each method
ft and glove overlap: 0 ft but not glove: 50 glove but not ft: 50
abstract, absurd, advocacy, antoinette, apologise, apologue, austerlitz, criticisms, culotte, defunct, depict, despotism, ecstacy, emporium, energetic, enthusiasm, envoy, equestrian, extract, extraordinary, falsehoods, flippancy, ida, jeannette, obstacle, opium, palatinate, pantechnicon, philip, pigault, polonaise, preliminary, reglar, reticule, schwartzenberg, skip, spaniel, suburb, suburbs, suitor, telegraph, telegraphic, temporary, temporise, tipsy, tranquille, tropical, truffigny, truncheon, tunic<unk>, abraham, achaiois, alge, appetens, bei, blameless, bobbins, cannibals, chevaux, contrasts, coupy, d'etre, della, der, dooze, downy, eccolo, etheke, expeditions, hellborough, homer, japan, jungly, lappets, launched, marine, milch, ministre, minois, modes, muswell, palls, pap, perfidious, plaisir, privateer, pumpernickelisch, requiring, sacre, sangviches, satiata, si, sneaking, spatter, spicy, struggled, velvets, vin, westwards
sgns and glove overlap: 0 sgns but not glove: 26 glove but not sgns: 50
ally, approaching, backed, banners, bravely, bumpers, canal, cent, charmante, chattels, contentedly, cooks, dolly, gleams, holland, hugely, imbibed, knack, moaning, radical, rickety, rigidly, sympathetic, tankard, weighed, x<unk>, abraham, achaiois, alge, appetens, bei, blameless, bobbins, cannibals, chevaux, contrasts, coupy, d'etre, della, der, dooze, downy, eccolo, etheke, expeditions, hellborough, homer, japan, jungly, lappets, launched, marine, milch, ministre, minois, modes, muswell, palls, pap, perfidious, plaisir, privateer, pumpernickelisch, requiring, sacre, sangviches, satiata, si, sneaking, spatter, spicy, struggled, velvets, vin, westwards
sgns and ft overlap: 0 sgns but not ft: 26 ft but not sgns: 50
ally, approaching, backed, banners, bravely, bumpers, canal, cent, charmante, chattels, contentedly, cooks, dolly, gleams, holland, hugely, imbibed, knack, moaning, radical, rickety, rigidly, sympathetic, tankard, weighed, xabstract, absurd, advocacy, antoinette, apologise, apologue, austerlitz, criticisms, culotte, defunct, depict, despotism, ecstacy, emporium, energetic, enthusiasm, envoy, equestrian, extract, extraordinary, falsehoods, flippancy, ida, jeannette, obstacle, opium, palatinate, pantechnicon, philip, pigault, polonaise, preliminary, reglar, reticule, schwartzenberg, skip, spaniel, suburb, suburbs, suitor, telegraph, telegraphic, temporary, temporise, tipsy, tranquille, tropical, truffigny, truncheon, tunic
ppmi and glove overlap: 1 ppmi but not glove: 49 glove but not ppmi: 49
hellborough addington, adjustment, alabaster, and, ariadne, authors, bargaining, beagles, botany, cabs, capri, crosses, enhanced, enumerated, essence, expansiveness, exploding, faver, ficci, flagon, flaring, glistened, hampers, heeltap, instituted, jackals, legion, meriting, mustering, outsides, palaces, palatinate, patrons, ponds, punter, rates, richness, shipwrecked, sniveller, sospiri, squalling, syriac, the, timbuctoo, trumperies, truncheon, wandsworth, whirl, yawns<unk>, abraham, achaiois, alge, appetens, bei, blameless, bobbins, cannibals, chevaux, contrasts, coupy, d'etre, della, der, dooze, downy, eccolo, etheke, expeditions, homer, japan, jungly, lappets, launched, marine, milch, ministre, minois, modes, muswell, palls, pap, perfidious, plaisir, privateer, pumpernickelisch, requiring, sacre, sangviches, satiata, si, sneaking, spatter, spicy, struggled, velvets, vin, westwards
ppmi and ft overlap: 2 ppmi but not ft: 48 ft but not ppmi: 48
palatinate, truncheon addington, adjustment, alabaster, and, ariadne, authors, bargaining, beagles, botany, cabs, capri, crosses, enhanced, enumerated, essence, expansiveness, exploding, faver, ficci, flagon, flaring, glistened, hampers, heeltap, hellborough, instituted, jackals, legion, meriting, mustering, outsides, palaces, patrons, ponds, punter, rates, richness, shipwrecked, sniveller, sospiri, squalling, syriac, the, timbuctoo, trumperies, wandsworth, whirl, yawnsabstract, absurd, advocacy, antoinette, apologise, apologue, austerlitz, criticisms, culotte, defunct, depict, despotism, ecstacy, emporium, energetic, enthusiasm, envoy, equestrian, extract, extraordinary, falsehoods, flippancy, ida, jeannette, obstacle, opium, pantechnicon, philip, pigault, polonaise, preliminary, reglar, reticule, schwartzenberg, skip, spaniel, suburb, suburbs, suitor, telegraph, telegraphic, temporary, temporise, tipsy, tranquille, tropical, truffigny, tunic
ppmi and sgns overlap: 0 ppmi but not sgns: 50 sgns but not ppmi: 26
addington, adjustment, alabaster, and, ariadne, authors, bargaining, beagles, botany, cabs, capri, crosses, enhanced, enumerated, essence, expansiveness, exploding, faver, ficci, flagon, flaring, glistened, hampers, heeltap, hellborough, instituted, jackals, legion, meriting, mustering, outsides, palaces, palatinate, patrons, ponds, punter, rates, richness, shipwrecked, sniveller, sospiri, squalling, syriac, the, timbuctoo, trumperies, truncheon, wandsworth, whirl, yawnsally, approaching, backed, banners, bravely, bumpers, canal, cent, charmante, chattels, contentedly, cooks, dolly, gleams, holland, hugely, imbibed, knack, moaning, radical, rickety, rigidly, sympathetic, tankard, weighed, x

Repeating the same comparison with heartd (but using 3 standard deviations as our threshold), we again find very different hubs across the methods.

In [14]:
k = 1000
threshold = 3 #NB 3 instead of 4
topn = 50
make_link()
compare_hub_words(heartd_all,'heartd', k=k,thresh=threshold,topn=topn)
make_anchor()
Comparison of top 50 hubs for heartd, k=1000, threshold=3 stds
Distinct hubs: 61 Combined overlap: 0
There are no hubs in common among the top 35 hubs for each method
ft and glove overlap: 0 ft but not glove: 9 glove but not ft: 35
commented, compressed, concentrated, confounded, considered, interspersed, pressed, resenting, strained<unk>, avoid, bang, blankets, boding, capacities, condemning, dirt, drainage, elephant, embracing, flush, frontal, further, grows, headlong, idiot, immediate, material, moaned, morituri, mounted, neat, ocean, punishment, quivering, sepulchre, spanners, specks, suavely, tools, uncalculating, unexciting, unpractical, weight
sgns and glove overlap: 0 sgns but not glove: 0 glove but not sgns: 35
<unk>, avoid, bang, blankets, boding, capacities, condemning, dirt, drainage, elephant, embracing, flush, frontal, further, grows, headlong, idiot, immediate, material, moaned, morituri, mounted, neat, ocean, punishment, quivering, sepulchre, spanners, specks, suavely, tools, uncalculating, unexciting, unpractical, weight
sgns and ft overlap: 0 sgns but not ft: 0 ft but not sgns: 9
commented, compressed, concentrated, confounded, considered, interspersed, pressed, resenting, strained
ppmi and glove overlap: 0 ppmi but not glove: 17 glove but not ppmi: 35
aged, anywhere, brick, buddha, caution, delusion, drollery, frankness, hammer, lugubrious, nobly, possibility, rising, senseless, skirts, towering, trace<unk>, avoid, bang, blankets, boding, capacities, condemning, dirt, drainage, elephant, embracing, flush, frontal, further, grows, headlong, idiot, immediate, material, moaned, morituri, mounted, neat, ocean, punishment, quivering, sepulchre, spanners, specks, suavely, tools, uncalculating, unexciting, unpractical, weight
ppmi and ft overlap: 0 ppmi but not ft: 17 ft but not ppmi: 9
aged, anywhere, brick, buddha, caution, delusion, drollery, frankness, hammer, lugubrious, nobly, possibility, rising, senseless, skirts, towering, tracecommented, compressed, concentrated, confounded, considered, interspersed, pressed, resenting, strained
ppmi and sgns overlap: 0 ppmi but not sgns: 17 sgns but not ppmi: 0
aged, anywhere, brick, buddha, caution, delusion, drollery, frankness, hammer, lugubrious, nobly, possibility, rising, senseless, skirts, towering, trace

Variability by frequency

We can also examine the role of word frequency for hubs. One aspect is to see if different frequency bands have different hubs. In fact, they do, with some hubs overlapping across the different frequency bands. Here's what that looks like for vfair.

In [15]:
def show_hubs_for_band(sampler,vecs,name,k=1000,threshold=2,step=5,topn=20,show_full=True):
    """
    use each band of width step of the vocabulary as others; use whole vocab as potentilas
    i.e. this looks for hubs that are _for_ the bands
    """
    
    what = Hubs.find_hubs_for_band(sampler,vecs,k=k,thresh=threshold,step=5)
    
    counts = Counter()
    same_bands = 0
    num_hubs = 0
    for i,(m,std,df_) in enumerate(what):
        df = df_[:topn].copy()
        counts.update(list(df['word']))
        df['same band'] = (df['percentile'] >= i) & (df['percentile'] <= i+step)
        same_bands += len(df[df['same band']])
        num_hubs += len(df)
        if show_full: 
            lname = "%s, hubs for range percentiles %d to %d" %(name, i,i+step)
            title = "Hubs for %s with k=%d and threshold of %d standard deviations" % (lname,k,threshold)
            show_title(title)
            stitle = "Overall mean: %0.4f &nbsp;&nbsp; Overall std: %0.4f" % (m,std)
            show_title(stitle)
            if len(df[df['word'] == '<unk>']) > 0:
                df = df.replace('<unk>', '&lt;unk&gt;') #rename <unk> so it appears in html (for glove)
            display(HTML(df.to_html(index=False)))

            
    if num_hubs > 0:
        d = [[w,sampler.get_percentile(w),c] for (w,c) in counts.most_common()]
        show_table(d,['Hub','Percentile','Number of bands'], 'Hubs and the number of bands they occured in for %s' % name)
        
        show_title('Distribution of hubs by percentile for %s' %name)
        pcounts = Counter()
        for (_,p,c) in d:
            pcounts.update({p:c})
        pdata = [pcounts[i] if i in pcounts else 0 for i in range(0,101)]
        
        fig, ax = plt.subplots(figsize=(10, 2))
        
        ax.bar(np.arange(0,101), pdata, color='orange')
        ax.set_xticks(np.arange(0,101,10))
        ax.set_xlabel('percentile')
        ax.set_ylabel('count')
       

        ax.spines["top"].set_visible(False)
        ax.spines["right"].set_visible(False)
        
        plt.show()
    
        show_title('Across the bands, %d of %d (= %0.2f) from among the %d hubs in each were in the target band' % 
               (same_bands, num_hubs, same_bands/num_hubs, topn))
    else:
        show_title('There are no hubs in any of the bands in %s at threshold = %0.2f' % (name, threshold))
    
def compare_hubs_for_band(combo,name,k=1000,threshold=2,step=5,topn=20,show_full=False):
    """
    do show_hubs_for_band for each method in combo
    """
    
    Utils.compare_methods(combo,name,show_hubs_for_band,k=k,threshold=threshold,step=step,topn=topn,show_full=show_full)
In [16]:
name = 'vfair, with sgns'
sampler = vfair_all['sampler']
vecs = vfair_all['sgns']
k = 1000
thresh = 4
step = 5
topn = 10
make_link()
show_hubs_for_band(sampler,vecs,name,k=k,threshold=thresh,step=step,topn=topn, show_full=False)
make_anchor()
Hubs and the number of bands they occured in for vfair, with sgns
Hub Percentile Number of bands
effect 9 9
generosity 4 8
tended 4 7
unfortunate 5 7
belonged 3 5
permitted 2 5
hearing 7 5
value 4 5
portrait 5 4
absolutely 3 4
surprise 6 4
coolness 3 4
born 8 4
consolation 4 4
abominable 4 4
price 6 3
game 7 3
champion 4 3
parted 5 3
dying 4 3
struck 6 3
intelligence 5 3
noise 4 3
trying 5 2
jim 6 2
alarm 6 2
paying 6 2
communicate 2 2
attentive 3 2
forced 9 2
presents 5 2
listened 6 2
report 3 2
disposed 6 2
personally 4 2
instructed 3 2
touch 4 2
perform 3 1
finished 4 1
charmante 0 1
anger 6 1
anxious 9 1
exactly 3 1
confessed 4 1
quietly 4 1
engage 3 1
remembering 2 1
assumed 3 1
prepared 6 1
civil 4 1
rickety 0 1
altogether 7 1
civilian 7 1
beat 6 1
slept 4 1
parting 8 1
disappointment 4 1
backed 1 1
probably 6 1
hinted 3 1
travel 2 1
questions 4 1
weighed 1 1
airs 7 1
cupid 2 1
consent 3 1
escaped 3 1
x 1 1
fun 4 1
bravely 1 1
beating 5 1
sigh 4 1
therefore 4 1
informed 6 1
treat 3 1
acknowledged 3 1
vauxhall 8 1
parliament 6 1
circumstances 6 1
sold 7 1
liking 3 1
dislike 3 1
strange 5 1
shock 4 1
frank 4 1
occasionally 4 1
act 11 1
entreaties 3 1
warning 4 1
fear 8 1
recovered 3 1
shame 6 1
force 5 1
restored 4 1
bumpers 1 1
dolly 1 1
lest 4 1
managed 3 1
gleams 0 1
quit 5 1
employed 4 1
nevertheless 4 1
immediately 6 1
deserted 2 1
moaning 1 1
event 7 1
Distribution of hubs by percentile for vfair, with sgns
Across the bands, 60 of 200 (= 0.30) from among the 10 hubs in each were in the target band

The obvious thing to notice is that the hubs are mostly relatively low frequency words (10th percentile or lower). As an additional consequence, only about 1/3 of the hubs were in the band being compared.

We can compare sgns with the other methods for vfair.

In [17]:
name = 'vfair'
combo = vfair_all
k = 1000
threshold = 4
step = 5
topn = 10
make_link()
compare_hubs_for_band(combo,name,k=k,threshold=threshold,step=step,topn=topn, show_full=False)
make_anchor()
Hubs and the number of bands they occured in for vfair, using sgns
Hub Percentile Number of bands
effect 9 9
generosity 4 8
tended 4 7
unfortunate 5 7
belonged 3 5
permitted 2 5
hearing 7 5
value 4 5
portrait 5 4
absolutely 3 4
surprise 6 4
coolness 3 4
born 8 4
consolation 4 4
abominable 4 4
price 6 3
game 7 3
champion 4 3
parted 5 3
dying 4 3
struck 6 3
intelligence 5 3
noise 4 3
trying 5 2
jim 6 2
alarm 6 2
paying 6 2
communicate 2 2
attentive 3 2
forced 9 2
presents 5 2
listened 6 2
report 3 2
disposed 6 2
personally 4 2
instructed 3 2
touch 4 2
perform 3 1
finished 4 1
charmante 0 1
anger 6 1
anxious 9 1
exactly 3 1
confessed 4 1
quietly 4 1
engage 3 1
remembering 2 1
assumed 3 1
prepared 6 1
civil 4 1
rickety 0 1
altogether 7 1
civilian 7 1
beat 6 1
slept 4 1
parting 8 1
disappointment 4 1
backed 1 1
probably 6 1
hinted 3 1
travel 2 1
questions 4 1
weighed 1 1
airs 7 1
cupid 2 1
consent 3 1
escaped 3 1
x 1 1
fun 4 1
bravely 1 1
beating 5 1
sigh 4 1
therefore 4 1
informed 6 1
treat 3 1
acknowledged 3 1
vauxhall 8 1
parliament 6 1
circumstances 6 1
sold 7 1
liking 3 1
dislike 3 1
strange 5 1
shock 4 1
frank 4 1
occasionally 4 1
act 11 1
entreaties 3 1
warning 4 1
fear 8 1
recovered 3 1
shame 6 1
force 5 1
restored 4 1
bumpers 1 1
dolly 1 1
lest 4 1
managed 3 1
gleams 0 1
quit 5 1
employed 4 1
nevertheless 4 1
immediately 6 1
deserted 2 1
moaning 1 1
event 7 1
Distribution of hubs by percentile for vfair, using sgns
Across the bands, 60 of 200 (= 0.30) from among the 10 hubs in each were in the target band

Hubs and the number of bands they occured in for vfair, using ft
Hub Percentile Number of bands
pluck 2 5
guilbert 0 4
cruel 9 4
obey 1 3
beggary 0 3
heel 0 3
debt 5 3
deum 0 3
method 0 3
kneel 0 3
sindbad 0 3
trustee 0 2
frighten 3 2
delay 2 2
geliebt 0 2
ferry 0 2
purpose 4 2
justify 0 2
exit 0 2
temporise 0 2
mein 0 2
vit 0 2
mammas 0 2
depict 1 2
rebuke 1 2
donor 0 2
reclaim 0 2
judge 3 2
esprit 0 2
morose 0 2
decease 0 2
nincompoop 0 2
joy 4 2
precipitancy 0 2
add 2 2
mauvaise 0 1
defy 2 1
abominably 1 1
deservedly 0 1
yesterday 9 1
calmly 1 1
budgebudge 0 1
explain 1 1
regret 3 1
purveyor 0 1
hopeful 0 1
newcomer 0 1
rebuild 0 1
needlework 0 1
exact 0 1
regency 0 1
split 0 1
methodist 0 1
unwieldily 0 1
disinherit 0 1
obstinately 0 1
reel 0 1
recreant 0 1
promptly 0 1
absolutely 3 1
minor 3 1
oftener 1 1
tunic 0 1
mope 0 1
deprecate 0 1
morsel 0 1
suburbs 0 1
sophy 0 1
myth 0 1
telegraphic 0 1
whatdyecallum 0 1
defunct 1 1
mad 3 1
refrain 0 1
culprit 0 1
telegraph 0 1
mimicry 0 1
defrays 0 1
supreme 1 1
recoil 0 1
prague 1 1
reglar 0 1
luck 7 1
begun 3 1
faugh 0 1
pye 0 1
heavens 4 1
melody 0 1
incompetency 0 1
maxim 0 1
wolsey 0 1
cruelly 1 1
mistrust 0 1
rejoin 0 1
preliminary 0 1
albeit 0 1
dieu 1 1
insular 0 1
rhapsody 0 1
undexterously 0 1
finish 3 1
vouchsafe 0 1
economist 0 1
surreptitiously 1 1
defend 3 1
donkey 0 1
madly 1 1
cruelty 2 1
murderer 0 1
forgot 5 1
enjoy 3 1
alas 3 1
surmise 0 1
begone 0 1
careful 0 1
apologise 0 1
mayor 0 1
speedy 2 1
git 1 1
forgery 0 1
devereux 0 1
unluckily 0 1
judah 0 1
whatdyecallem 0 1
munoz 0 1
observer 1 1
unfortunate 5 1
g 4 1
dee 0 1
justice 6 1
accept 5 1
afford 2 1
forbid 1 1
energetic 0 1
m 4 1
hardy 0 1
enthusiasm 3 1
spaniel 1 1
impulse 1 1
sometime 0 1
dye 0 1
sorry 4 1
circuit 0 1
luckily 1 1
don 2 1
defray 0 1
moin 0 1
asleep 5 1
Distribution of hubs by percentile for vfair, using ft
Across the bands, 19 of 198 (= 0.10) from among the 10 hubs in each were in the target band

Hubs and the number of bands they occured in for vfair, using glove
Hub Percentile Number of bands
kartoffeln 0 13
toute 0 11
and 99 10
the 100 9
of 99 8
it 94 5
to 99 5
amelia 81 5
when 91 5
her 98 5
braten 0 5
crawley 89 5
but 91 5
thought 69 4
he 97 4
teething 0 3
's 96 3
a 99 3
if 81 3
this 89 3
him 93 3
aussi 0 2
were 89 2
one 84 2
so 88 2
i 95 2
said 92 2
have 91 2
such 74 2
osborne 81 2
all 90 2
was 98 2
in 98 2
miss 89 2
as 95 2
little 90 2
people 55 1
his 97 1
with 96 1
contrasts 0 1
velvets 0 1
for 94 1
pitt 76 1
about 82 1
window 17 1
there 86 1
they 87 1
major 70 1
marine 0 1
which 94 1
briggs 56 1
after 77 1
made 75 1
at 95 1
husband 53 1
mr 85 1
alge 0 1
by 91 1
love 51 1
had 96 1
you 94 1
she 96 1
day 73 1
some 75 1
be 92 1
bobbins 0 1
we 80 1
put 50 1
not 93 1
before 73 1
midst 7 1
my 88 1
sneaking 0 1
westwards 0 1
now 68 1
privateer 0 1
cannibals 0 1
over 76 1
knew 43 1
stupidest 0 1
here 56 1
who 92 1
boy 61 1
0 1
that 97 1
whom 64 1
would 87 1
Distribution of hubs by percentile for vfair, using glove
Across the bands, 10 of 198 (= 0.05) from among the 10 hubs in each were in the target band

Hubs and the number of bands they occured in for vfair, using ppmi
Hub Percentile Number of bands
though 54 8
told 46 7
always 59 6
course 43 6
asked 44 5
present 35 5
this 89 4
now 68 4
being 49 4
once 51 4
everything 30 4
wife 61 3
for 94 3
were 89 3
rebecca 78 3
only 71 3
known 22 3
father 63 3
be 92 2
very 86 2
it 94 2
indeed 49 2
not 93 2
becky 72 2
ordered 15 2
husband 53 2
used 47 2
come 70 2
man 77 2
rawdon 79 2
when 91 2
before 73 2
dobbin 79 2
emmy 49 2
him 93 2
boy 61 2
who 92 2
own 71 2
little 90 2
would 87 2
them 80 2
too 69 1
brother 48 1
with 96 1
adjustment 0 1
nobody 16 1
mustering 0 1
should 65 1
coming 25 1
about 82 1
money 59 1
if 81 1
they 87 1
the 100 1
how 82 1
flaring 0 1
major 70 1
after 77 1
pretty 39 1
made 75 1
point 14 1
mrs 86 1
there 86 1
briggs 56 1
name 31 1
and 99 1
or 88 1
such 74 1
alone 23 1
do 68 1
woman 61 1
first 55 1
jackals 0 1
instituted 0 1
george 83 1
then 61 1
take 58 1
near 14 1
comfort 14 1
well 62 1
pleasure 28 1
legion 0 1
was 98 1
look 51 1
palatinate 0 1
saying 12 1
outsides 0 1
must 64 1
quite 57 1
a 99 1
thought 69 1
school 29 1
again 43 1
been 84 1
duty 23 1
everybody 35 1
had 96 1
much 74 1
came 72 1
more 76 1
liked 20 1
as 95 1
back 60 1
lady 85 1
on 93 1
her 98 1
exploding 0 1
reason 9 1
said 92 1
enhanced 0 1
gave 51 1
other 71 1
an 85 1
time 65 1
story 23 1
way 55 1
never 75 1
Distribution of hubs by percentile for vfair, using ppmi
Across the bands, 10 of 200 (= 0.05) from among the 10 hubs in each were in the target band

There are a couple striking differences across the methods. The first is that sgns and ft show similar patterns, with hubs being primarily relatively low frequency words. glove and ppmi have spikes at the lowest frequency words, but the rest of the hubs in glove are spread out among the higher frequency words, while with ppmi the hubs occur more or less across the spectrum.

The other difference is the proportion of hubs occuring in the band being compared, from 0.31 for sgns to 0.12 for ft, down to 0.06 and 0.05 for glove and ppmi respectively.

We can check to see if the same patterns hold for heartd. They don't — rather there's a massive breakdown, except for ppmi. There are no hubs for sgns and ft, and only 1 for glove.

In [18]:
name = 'heartd'
combo = heartd_all
k = 1000
threshold = 3
step = 5
topn = 10
make_link()
compare_hubs_for_band(combo,name,k=k,threshold=threshold,step=step,topn=topn, show_full=False)
make_anchor()
There are no hubs in any of the bands in heartd, using sgns at threshold = 3.00

There are no hubs in any of the bands in heartd, using ft at threshold = 3.00

Hubs and the number of bands they occured in for heartd, using glove
Hub Percentile Number of bands
shrugs 0 2
Distribution of hubs by percentile for heartd, using glove
Across the bands, 0 of 2 (= 0.00) from among the 10 hubs in each were in the target band

Hubs and the number of bands they occured in for heartd, using ppmi
Hub Percentile Number of bands
all 78 5
that 90 5
had 92 5
to 95 5
in 93 5
we 72 4
it 91 4
well 41 4
my 82 4
of 98 4
keep 17 4
only 43 3
not 82 3
i 98 3
and 96 3
he 92 3
but 80 3
perhaps 23 3
a 97 2
which 48 2
thing 28 2
me 83 2
the 99 2
him 81 2
made 49 2
at 84 2
knew 17 2
as 86 2
man 64 2
you 87 2
seemed 50 2
would 71 2
best 8 1
shutter 10 1
clear 12 1
into 55 1
his 88 1
very 69 1
out 73 1
by 72 1
were 74 1
who 45 1
there 79 1
came 48 1
river 45 1
their 57 1
one 72 1
lugubrious 0 1
could 62 1
lost 21 1
with 89 1
was 94 1
wanted 21 1
after 42 1
enough 27 1
mr 41 1
they 76 1
's 62 1
said 70 1
oh 22 1
like 65 1
am 29 1
last 37 1
upon 44 1
say 39 1
this 77 1
being 28 1
know 58 1
pilgrims 25 1
first 31 1
true 11 1
nobly 0 1
manager 37 1
looking 21 1
did 58 1
will 30 1
getting 7 1
think 31 1
better 9 1
some 62 1
saw 38 1
brick 0 1
Distribution of hubs by percentile for heartd, using ppmi
Across the bands, 4 of 148 (= 0.03) from among the 10 hubs in each were in the target band

Discussion

It seems like there should be a connection between the frequency effects with hubs and other frequency effects we've seen. However, the connection isn't straightforward, especially for heartd.

We saw in the stratification post that:

  • for sgns and ft, frequency is inversely related to similarity
  • for glove and ppmi frequency is directly related to similarity

Starting with sgns and ft, two low frequency words are more similar than two high frequency words. Since there are lots of low frequency words, that might explain why we get the hubs in the lower frequency words. Glove works the opposite: two high frequency words are more similar than two low frequency words, and we get high frequency hubs. The spike in hubs at the lowest frequencies corresponds to the anomalous cell in the stratification, where the lowest frequency words are more similar to each other than even slightly more frequent words.

Nice so far. Unfortunately, ppmi doesn't follow the pattern. It shows similar stratification to glove, but its hubs, as we saw, are across the board.

Heartd also only partially follows the pattern. sgns and ft are both extremely stratified, which might explain why there are no hubs: everything is close to everything else. However, glove and ppmi are more like vfair for stratification, but glove has only 1 hub in heartd, while ppmi has many.

Thus, even though the stratification of similarities may be relevant for understanding hubs, it is clearly not sufficient.

Back to the introduction

References

[1] Nenad Tomašev, Milos Radovanovic, Dunja Mladenić, and Mirjana Ivanovic. 2011. A probabilistic approach to nearest-neighbor classification: Naive hubness Bayesian kNN. Proceedings of the International Conference on Information and Knowledge Management. 2173-2176.

[2] Johannes Hellrich and Udo Hahn. 2016. Bad Company—Neighborhoods in Neural Embedding Spaces Considered Harmful. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2785–2796, Osaka, Japan, December 11-17 2016.

[3] Maria Antoniak and David Mimno. 2018. Evaluating the Stability of Embedding-based Word Similarities. In Transactions of the Association for Computational Linguistics, vol. 6, pp. 107–119.

In [ ]: