© 2018 Chris Culy, April 2018

Overview

This is one of a series of posts on using word vectors with small corpora. In this post I discuss ways to explore word similarities that don’t compare the model results with precompiled human judgments. In particular, I show how we can use “the closests of the closests” as well as the slope of similar items to get different perspectives on word similarities, both within a corpus and across corpora.

library(readr)
library(tidyverse)
library(scales)
library(reticulate)
prep = FALSE
# set up python functions
if (prep) {
  use_condaenv("textp") #this must have gensim installed
  
  psys <- import('sys')
  psys$path <- c(psys$path,getwd())
  
  svd_similarity <- import("similarities")
  
  get_most_similar <- function(text,words,n,win,dim,min_count) {
    x <- svd_similarity$get_most_similar(text,words,n=n,win=win,dim=dim,min_count=min_count)
    
    nr = length(x)
    what <- data.frame(matrix(unlist(x), nrow=nr, byrow=T),stringsAsFactors=FALSE)
    names(what) <- c('text','word','win','dim','min_count','item','rank','sim')
    
    return(what)
  }
}
# calculate/load  data that needs python
if (prep) {
  text <- "vfair"
  n <- 10
  min_count <- NULL
  words <- c("house","horse","awful","life","letters","act","road","listened","pardon","particulars","woke","abominable","doings","alas")
  
  vfair_sims <-
    get_most_similar(text,words,n=n,win=NULL,dim=NULL,min_count=min_count)
  write_tsv(vfair_sims,"sims/vfair_sims.csv")
}
vfair_sims <- read_tsv("sims/vfair_sims.csv")
if (prep) {
  text <- "waywe"
  n <- 10
  min_count <- 1
  words <- c("house","horse")
  waywe_house_horse <-
    get_most_similar(text,words,n=n,win=NULL,dim=NULL,min_count=min_count)
  write_tsv(waywe_house_horse,"sims/waywe_house_horse.csv")
}
#waywe_house_horse <- read_tsv("sims/waywe_house_horse.csv")
if (prep) {
  text <- "moby"
  n <- 10
  min_count <- 1
  words <- c("house","horse")
  moby_house_horse <-
    get_most_similar(text,words,n=n,win=NULL,dim=NULL,min_count=min_count)
  write_tsv(moby_house_horse,"sims/moby_house_horse.csv")
}
#moby_house_horse <- read_tsv("sims/moby_house_horse.csv")
if (prep) {
  text <- "kidnapped"
  n <- 10
  min_count <- 1
  words <- c("house","horse")
  kidnapped_house_horse <-
    get_most_similar(text,words,n=n,win=NULL,dim=NULL,min_count=min_count)
  write_tsv(kidnapped_house_horse,"sims/kidnapped_house_horse.csv")
}
#kidnapped_house_horse <- read_tsv("sims/kidnapped_house_horse.csv")
if (prep) {
  text <- "dracula"
  n <- 10
  words <- c("house","horse","life","awful")
  dracula_sims <-
    get_most_similar(text,words,n=n,win=NULL,dim=NULL,min_count=NULL)
  write_tsv(dracula_sims,"sims/dracula_sims.csv")
}
dracula_sims <- read_tsv("sims/dracula_sims.csv")
if (prep) {
  text <- "jane"
  words <- c("house","horse","life","awful")
  jane_sims <- get_most_similar(text,words,n=n,win=NULL,dim=NULL,min_count=NULL)
  write_tsv(jane_sims, "sims/jane_sims.csv")
}
jane_sims <- read_tsv("sims/jane_sims.csv")
if (prep) {
  text <- "threemen"
  words <- c("house","horse","life","awful")
  threemen_sims <- get_most_similar(text,words,n=n,win=NULL,dim=NULL,min_count=NULL)
  write_tsv(threemen_sims, "sims/threemen_sims.csv")
}
threemen_sims <- read_tsv("sims/threemen_sims.csv")
###############
#load simple data
cnames <- list('davidc', 'rbadge', 'dracula', 'moby', 'scarlet', 'emma', 'moonstone', 'frankenstein', 'pym', 'sybil', 'heartd', 'grubb', 'threemen', 'jane', 'nabbey', 'vfair', 'dorian', 'waywe', 'kidnapped', 'wuthering')
eval_texts <- list('sign4','jude','midmarch')
test_names <- list("ws353","ws353_similarity","ws353_relatedness","bruni_men")
counts <- Reduce(rbind,lapply(cnames, function(c){
  fname <- paste0("counts/",c,"-counts.csv")
  these_counts <- read_delim(fname, "\t", escape_double = FALSE, trim_ws = TRUE) %>%
    mutate(text=c, rank=row_number()) 
  
  #now add rank percentile [NOT count percentile, which isn't useful]
  len <- nrow(these_counts)
  these_counts %>% mutate(percentile=round(100*(1-rank/len), digits=2))
  
})) %>% select(text, everything())
best_SVD_sim_scores <- read_delim("testsets_parameters/SVD-sim_tests.csv", "\t", escape_double = FALSE, trim_ws = TRUE) %>%
  filter(!(text %in% eval_texts)) %>%
  group_by(testset,text) %>%
  filter(sF1 == max(sF1))

Background

In the previous post, Finding useful parameters and methods, I discussed finding the values for the parameters window size (win), number of dimensions (dim), and the minimum frequency of words to be included (min_count) that give the highest sF1 scores on four standard testsets of human similarity judgments. However, as mentioned at the end of that post, it is reasonable to question whether those testsets are relevant to evaluating these small corpora (recall that I am using 19th century novels as my corpora).

One problem with the testsets in this context is that they contain a lot of vocabulary that is not found in any single text. We saw that the best recall was still less than 75%, and with smaller texts it was not uncommon for recall to be less than 30%. The testsets then tell us little about the vocabulary that is in the texts.

A second problem with the testsets is that they reflect 21st century judgments about word usage, and we know that word usage (and meaning) has changed since the 19th century. In fact, there have been recent papers using word vectors to quantify those changes ([1], [2], [3]). In other words, the testsets can only tell us about 21st century interpretations of 19th century usage. While that might be interesting in its own right, it is not a way to discover how 19th century authors used words.

A third issue in using word vectors to explore word similarities is that different parameter settings give different similarities. Here is one example from Vanity Fair. I’ve chosen the word house, as the most common noun (lady is more common but it can also be a title, as in lady jane, since we have lowercased everything in the preprocessing.) I’ve fixed the window size at 5, and for each dimension in the parameters experiment, we see the 3 words that the model judges as being the most similar to house.

Throughout this post, as in the parameters experiment, I am using ppmi_svd vectors created using hyperwords [4], with similarities calculated using gensim [5]).

vfair_house_horse <- vfair_sims %>% filter(min_count==1,word=='house' | word=='horse')
vfair_house_horse %>% 
  filter(word=='house',win==5, rank<4) %>%
  ggplot() + theme_classic() + 
  labs(title="Closest 3 words to 'house' in Vanity Fair, with win=5,min_count=1") +
  scale_y_continuous(limits = c(0,1)) +
  geom_text(aes(dim,sim,label=item, color=factor(rank)), alpha=0.75, size=4, show.legend = FALSE)

We can immediately see two issues:

A third issue is that although win=4, dim=400 was the best scoring setting for the testsets, it gives the lowest similarity scores for house. In other words, a model that does well on the testsets will not necessarily give the highest similarity scores.

If we add in the closest words to horse, we see that similarity scores across words varies, even within the same settings. For example, the closest words to horse with dim=25 have a similarity of a little more than 0.75, while for house the scores are closer to 0.9.

vfair_house_horse %>% 
  filter(win==5, rank<4) %>%
  ggplot() + theme_classic() + 
  labs(title="Closest 3 words to 'house' and 'horse' in Vanity Fair\nwin=5, min_count=1") +
  scale_y_continuous(limits = c(0,1)) +
  scale_x_continuous(limits = c(0,450)) +
  geom_text(aes(dim,sim,label=item, color=factor(rank)), size=4, alpha=0.75, show.legend = FALSE) + 
  facet_wrap(~word, ncol=1)

Finally, when we look at a different text, Dracula, we see different words (not surprisingly), and different scores. There is also a slight difference in the trends of the scores across the dimensions.

dracula_house_horse <- dracula_sims %>% filter(min_count==1,word=='house' | word=='horse')
dracula_house_horse %>% 
  filter(win==5, rank<4) %>%
  ggplot() + theme_classic() + 
  labs(title="Closest 3 words to 'house' and 'horse' in Dracula\nwin=5, min_count=1") +
  scale_y_continuous(limits = c(0,1)) +
  scale_x_continuous(limits = c(0,450)) +
  geom_text(aes(dim,sim,label=item, color=factor(rank)), size=4, show.legend = FALSE) + 
  facet_wrap(~word, ncol=1)

rbind(vfair_house_horse,dracula_house_horse) %>%
  filter(win==5, rank==1) %>%
  ggplot() + theme_classic() + 
  labs(title="Trends for closest word to house, horse\nin Vanity Fair, Dracula, win=5, min_count=1") +
  scale_y_continuous(limits = c(0,1)) +
  geom_line(aes(dim,sim,color=text)) + 
  geom_point(aes(dim,sim,color=text, shape=text), size=2) +
  facet_wrap(~word, ncol=1)

There are thus three fundamental issues:

I will address each of these in turn.

Exploring variation across models

Here is a selection of words in Vanity Fair of different frequencies, chosen from the top 85 to 99 percentiles by rank. I’ll draw from these words in the examples that follow.

counts %>% filter(text=='vfair', percentile %in% seq(from=85,to=99,by=2)) %>% arrange(rank)

Closest of the closest

Given a word, e.g. life, one thing we do is look at the closest words to it across all the parameters, and find which of those has the highest similarity – the closest of the closest.

As an aside, we can note that for this sample, dim is always 25 (the smallest number of dimensions tested), min_count is usually, but not always 1, and win varies. This is in contrast to the results in the testset evaluations, where min_count=1 always gave the best sF1 scores across testsets, while dim and win varied.

vfair_sims_reduced <- vfair_sims %>% 
  filter(word !=  'house' & word != 'horse') %>%
  left_join(counts, by=c('text','word'='item')) %>%
  rename(text_rank=rank.y,sim_rank=rank.x)
vfair_sims_reduced %>% 
  group_by(word) %>%
  filter(sim==max(sim)) %>%
  select(word,percentile,item,sim,win,dim,min_count) %>%
  arrange(-sim, word, item)

The similarities range from just over 0.80 to little over 0.93, and there seems to be no strong relation between the rank percentile and the similarity of the closest of the closest.

vfair_sims_reduced %>% 
  group_by(word) %>%
  filter(sim==max(sim)) %>%
  ggplot() + theme_classic() +
  labs(title="Closest of the closest for selected words in Vanity Fair") +
  scale_x_continuous(limits = c(84,100)) +
  geom_text(aes(percentile,sim, label=paste0(word,":",item))) +
  geom_smooth(aes(percentile,sim), method='loess', color="orange")

If we want to explore a particular word with respect to other words, we can choose the model with the parameter settings for the closest of the closest and procede from there.

Slope

Let’s take a look at the 5 words closest to life across parameters, by their similarities. I’ll limit the similarities to > 0.66 for clarity.

vfair_sims_reduced %>% 
  filter(word=='life', sim_rank<6, sim>0.66) %>%
  ggplot() + theme_classic() +
  labs(title="5 closest words to 'life' in Vanity Fair (sim>0.66), by win and dim") +
  scale_x_continuous(limits = c(0.5,5.5)) +
  geom_line(aes(sim_rank,sim, color=factor(min_count))) +
  geom_text(aes(sim_rank,sim,label=item, color=factor(min_count)), show.legend = FALSE) +
  facet_wrap(win ~ dim, ncol = 2)

We see again here that the similarity scores vary quite a bit. We can also see that the range of similarity scores varies from the 1st to 5th closest word for a given set of parameters. While there are different ways we might use this information, one way is to look at the slope of the line going through the first and last words. By using the slope instead of the range, we abstract away from the absolute similarity values.

To find potentially interesting closest items, we can look for the model that has the steepest slope and the model that has the shallowest. (There are other things we could do as well, but this is a start.)

show_min_max_slopes <- function(info,wd,nclosest=5,tname=NULL) {
  dx = nclosest-1
  slopes <- info %>% 
    filter(word == wd, sim_rank <= nclosest) %>%
    group_by(text,win,dim,min_count) %>%
    mutate(slope=(max(sim)-min(sim))/dx) %>%
    ungroup()
  
  slopes_minmax <- slopes %>% 
    group_by(text) %>%
    filter(slope == max(slope) | slope == min(slope))
  
  if (is.null(tname)) {
    tname <- info$text[[1]]
  }
  
  t <- paste0("Steepest and shallowest slopes\nfor the ", 
             nclosest, ' words closest to "', wd, '" in ', tname)
  
  g <- slopes_minmax %>%
    ggplot() + theme_classic() + labs(title=t) + 
    scale_x_continuous(limits = c(0.5,nclosest+1.5), breaks=seq(1,nclosest)) +
    geom_line(aes(sim_rank,sim, color=factor(slope), alpha=0.75), show.legend = FALSE) +
    geom_text(aes(sim_rank,sim,label=item, color=factor(slope)), show.legend = FALSE) +
    geom_text(data=filter(slopes_minmax,sim_rank==nclosest),
              aes(nclosest+1,sim,
                  label=sprintf('(%0.5f)',slope), 
                  color=factor(slope)),
              show.legend = FALSE)
  g
}
show_min_max_slopes(vfair_sims_reduced,'life',nclosest = 5, tname="Vanity Fair")

Of course, there is nothing special about choosing the 5 closest items. We might choose just 2, or maybe 10.

show_min_max_slopes(vfair_sims_reduced,'life',nclosest = 2, tname="Vanity Fair")

show_min_max_slopes(vfair_sims_reduced,'life',nclosest = 10, tname="Vanity Fair")

As we can see, the closest words vary dramatically not only by the slope, but by how many we choose to focus on. The reason for the differences across the number of words chosen has two aspects. The first aspect is that different parameter settings give different similarities, as we’ve seen numerous times, and of course that’s how we can have a difference between the steepest and shallowest slope in the first place.

The second aspect has to do with the behavior of the similarity lines: they are not straight lines, but a series of segments, each of which may have a different slope. The slope of the whole series is the slope of a line from the first item to the last item, not taking into account the internal variations. So a series which may have a steep slope over the first few items might level out and have a shallower overall slope. Similarly, a series which starts out with a shallow slope may decline sharply, and end up with steeper slope overall.

We can see this illustrated in the following two charts. In each, I have used the subset of 9 parameter settings which rank first as the most similar word to life. The first chart shows the similarity lines and their slope over 5 words; the second chart shows them over 10 words.

Over 5 words, the sequence starting <first, station, squeezed> has the steepest slope (0.02191), but over 10 words the corresponding sequence has a slope of (0.01098), which is not the steepest slope. Rather, the sequence starting with <first, part, every> has the steepest slope over 10 words (0.01601), although its slope over 5 words was not the steepest, at (0.01692).

Although the sequence that has the shallowest slope is the same one over 5 and 10 words, starting with <first, days, delirium), this is a coincidence: it doesn’t have the shallowest slope over 3 words (not shown here).

vfair_life_first <- vfair_sims_reduced %>% 
  filter(word=='life', (item=='first' & sim_rank==1)) %>%
  select(word,win,dim,min_count) %>% unique() %>%
  left_join(vfair_sims_reduced)
compare_life_first <- function(nclosest) {
  dx = nclosest-1
  
  slopes <- vfair_life_first %>%
    filter(sim_rank <= nclosest) %>%
    group_by(win,dim,min_count) %>%
    mutate(slope=(max(sim)-min(sim))/dx) %>%
    ungroup()
  
  t <- paste("Comparison of slopes for 'life' with 'first' as closest,\nnclosest = ",nclosest)
 
  g <- slopes %>%
    ggplot() + theme_classic() + labs(title=t) + 
    scale_x_continuous(limits = c(0.5,nclosest+1.5), breaks=seq(1,nclosest)) +
    geom_line(aes(sim_rank,sim, color=factor(slope), alpha=0.75), show.legend = FALSE) +
    geom_text(aes(sim_rank,sim,label=item, color=factor(slope)), show.legend = FALSE) +
    geom_text(data=filter(slopes,sim_rank==nclosest),
              aes(nclosest+1,sim,
                  label=sprintf('(%0.5f)',slope), 
                  color=factor(slope)),
              show.legend = FALSE)
  g
}
compare_life_first(5) 

compare_life_first(10) 

Exploring variation within a single model

We can also explore multiple words within a single model using the notion of slope. For example, we might choose a large-ish window, since that should give more “semantic” results, according to the literature. We’ll pick a window of 10, and arbitrarily fix dim=400 and min_count=1.

show_model_slopes <- function(info,wds,nclosest = 5,tname=NULL) {
  dx <- nclosest-1
  slopes <- info %>% 
      filter(word %in% wds, sim_rank <= nclosest) %>%
      group_by(word,win,dim,min_count) %>%
      mutate(slope=(max(sim)-min(sim))/dx) %>%
      ungroup()
  
  if (is.null(tname)) {
    tname <- info$text[[1]]
  }
  
  wds_str <- paste(wds, collapse = ", ")
  t <- paste0("Slopes for the ", nclosest, ' words closest to\n', wds_str, '\nin ', tname)
  
  
  
  g <- slopes %>%
    ggplot() + theme_classic() + labs(title=t) + 
    scale_x_continuous(limits = c(0.5,nclosest+2.0), breaks=seq(1,nclosest)) +
    geom_line(aes(sim_rank,sim, color=word, alpha=0.75), show.legend = FALSE) +
    geom_text(aes(sim_rank,sim,label=item, color=word), show.legend = FALSE) +
    geom_text(data=filter(slopes,sim_rank==nclosest),
              aes(nclosest+1,sim,
                  label=sprintf('(%s: %0.5f)',word,slope), 
                  color=word),
              show.legend = FALSE)
  
  
  g
}
nclosest <- 4
w <- 10
d <- 400
mc <- 1
wds <- unique(vfair_sims_reduced$word)
info <- vfair_sims_reduced %>% 
  filter(win==w,dim==d,min_count==mc)
show_model_slopes(info, nclosest = nclosest, wds = wds, tname = paste0("Vanity Fair, with win=",w," dim=",d,", and min_count=",mc))

We might also try parameters that according to the literature should give more syntax-oriented results, namely with a small window. We’ll pick a window of 2, and arbitrarily fix dim=100 and min_count=1.

nclosest <- 4
w <- 2
d <- 100
mc <- 1
wds <- unique(vfair_sims_reduced$word)
info <- vfair_sims_reduced %>% 
  filter(win==w,dim==d,min_count==mc)
show_model_slopes(info, nclosest = nclosest, wds = wds, tname = paste0("Vanity Fair, with win=",w," dim=",d,", and min_count=",mc))

Exploring variation across corpora

The test data: awful

We can use awful as an example to compare words across corpora, one that is used in papers on word vectors and semantic change. We’ll use four texts, two from the mid 1800s and two from the late 1800s. This fits roughly with one timespan showing change as in ([2]). (Interestingly Emma, which is even earlier (1815), does not contain the word awful.) We expect that the two earlier books would use awful differently from the two later books.

  • Jane Eyre: 1847
  • Vanity Fair: 1847
  • Three Men in a Boat: 1889
  • Dracula: 1897
#using vfair,jane,dracula,threemen
awful_sims <- rbind(vfair_sims,dracula_sims,jane_sims,threemen_sims) %>%
  filter(word=="awful") %>%
  left_join(counts, by=c('text','word'='item')) %>%
  rename(text_rank=rank.y,sim_rank=rank.x)

The closests of the closests

Here we see that the closests of the closest vary across all the texts.

awful_sims %>% 
  group_by(text) %>%
  filter(sim==max(sim)) %>%
  arrange(-sim)

We can tally the number of models for which an item is the closest to awful. For brevity, I’ve limited this to items which occurs as the closest item at least 5 times. There are certainly things that deserve a closer look, like an showing up in 5 different models for Three Men in a Boat. However, we’ll keep moving.

awful_sims %>% 
  filter(sim_rank==1) %>%
  group_by(text,item) %>%
  summarize(n=n()) %>%
  filter(n>=5) %>%
  arrange(text,-n,item)

Are there any overlaps in the closest of the closest across texts? Just 4, which isn’t very many, and they all occur in Vanity Fair.

awful_sims %>% 
  filter(sim_rank==1) %>%
  group_by(item,text) %>%
  summarize(n=n()) %>%
  group_by(item) %>%
  summarize(ntexts=n()) %>%
  filter(ntexts>1) %>%
  inner_join(filter(awful_sims, sim_rank==1), by=c("item")) %>%
  select(item,text) %>% 
  unique() %>%
  arrange(item,text)

We can expand to, for example, the 5 closest.

awful5 <- awful_sims %>% 
  filter(sim_rank<=5) %>%
  group_by(item,text) %>%
  summarize(n=n()) %>%
  group_by(item) %>%
  summarize(ntexts=n()) %>%
  filter(ntexts>1) %>%
  inner_join(awful_sims, by=c("item")) %>%
  select(item,text) %>% 
  unique() 
t <- sprintf("%d items are in the closest 5 to 'awful'\nin more than one text", length(unique(awful5$item)))
awful5 %>%
  ggplot() + theme_classic() + labs(title=t) +
  scale_y_discrete(limits = sort(unique(awful5$item), decreasing = TRUE)) +
  geom_point(aes(text,item,color=text), show.legend = FALSE)

We can compare these texts using the settings from one of the studies in the literature. In [2] they use win=4 and dim=300. We can approximate those with win=5 and dim=200 (or dim=400). They use different min_count values for different corpora, but well above ours: ours are 1, 3, 5, 10, 20 while theirs are 100 and 500.

There is very little overlap across the texts (with similar results for dim=400 and even increasing the n-closest to 10).

nclosest = 5
d = 200
awful_approx <- awful_sims %>%
  filter(win==5,dim==d,min_count<5,sim_rank<=nclosest)
  
t = sprintf("Closest %d items with dim=%d,\nand min_count is 1 or 3", nclosest, d)
awful_approx %>%  
  ggplot() + theme_classic() + labs(title=t) +
  scale_y_discrete(limits = sort(unique(awful_approx$item), decreasing = TRUE)) +
  geom_point(aes(text,item,color=text), show.legend = FALSE)

Similarity lines

Since there is little overlap among the closest words across texts, we might use the slope technique to get a different idea of what’s going on.

show_awful_slopes <- function(info,nclosest = 5,tname=NULL) {
  dx <- nclosest-1
  slopes <- info %>% 
      filter(sim_rank <= nclosest) %>%
      group_by(text) %>%
      mutate(slope=(max(sim)-min(sim))/dx) %>%
      ungroup()
  
  if (is.null(tname)) {
    tname <- ""
  }
  
  t <- paste0("Slopes for the ", nclosest, ' words closest to awful ',tname)
  
  g <- slopes %>%
    ggplot() + theme_classic() + labs(title=t) + 
    scale_x_continuous(limits = c(0.5,nclosest+2.0), breaks=seq(1,nclosest)) +
    geom_line(aes(sim_rank,sim, color=text, alpha=0.75), show.legend = FALSE) +
    geom_text(aes(sim_rank,sim, label=item, color=text), show.legend = FALSE) +
    geom_text(data=filter(slopes,sim_rank==nclosest),
              aes(nclosest+1,sim,
                  label=sprintf('(%s: %0.5f)',text,slope), 
                  color=text),
              show.legend = FALSE)
  g
}
nclosest = 5
d = 200
awful_approx <- awful_sims %>%
  filter(win==5,dim == d, min_count==1) #,sim_rank<=nclosest)
  
t = sprintf("with win=5, dim=%d, and min_count=1\n", d)
show_awful_slopes(awful_approx, nclosest = nclosest, tname = t)

d = 400
awful_approx <- awful_sims %>%
  filter(win==5,dim == d, min_count==1)
  
t = sprintf("with win=5, dim=%d, and min_count=1\n", d)
show_awful_slopes(awful_approx, nclosest = nclosest, tname = t)

Not surprisingly, the closest words are different when we change the number of dimensions, though there is overlap, and in the case of threemen, the overlap (and the ordering) is exact.

As before, we can also find the steepest and shallowest slopes for awful for each text.

awful_sims %>%
  show_min_max_slopes(wd = 'awful',nclosest = 5, tname = "the 4 texts") +
  facet_wrap(~text, ncol=1)

On alignment

In work using word vectors to explore meaning shift across time, word vectors from different time slices are aligned to a common coordinate system ([1], [2], [3]). However, there are two significant issues with trying to align these small corpora, namely the large non-overlap of vocabulary across the texts and the distortion of distances.

In order to align two different vector space models, only their common vocabulary can be used (possibly just a subset, even). However, in the case of these small corpora, not only is their overlap relatively constrained, but we cannot be sure, a priori, that the non-shared words are not important for the vector of the word(s) we are interested in. Therefore, alignment is not an appropriate technique for comparing small corpora. For large corpora, the assumption is that all the important words will be in the shared vocabulary, and that is probably a reasonable assumption.

Shared (total) vocabulary across texts

e.g. ~ 49% of vfair is shared with jane while ~ 62% of jane is shared with vfair

txts = c("dracula","jane","threemen","vfair")
shared_vocab <- sapply(txts, function(t1) {
  t1_wds <- filter(counts,text==t1)
  
  x <- sapply(txts,function(t2) {
    t2_wds <- filter(counts,text==t2)
    
    info <- t2_wds %>% 
      filter(item %in% t1_wds$item)
    
    return( nrow(info)/nrow(t2_wds) )
    
    })
  return(x)
})
shared_vocab
           dracula      jane  threemen     vfair
dracula  1.0000000 0.6189812 0.4150134 0.6346381
jane     0.4581316 1.0000000 0.3449480 0.6157632
threemen 0.5709649 0.6411921 1.0000000 0.6864857
vfair    0.3745096 0.4909505 0.2944564 1.0000000

The second issue has to do with distance distortion. The methods for aligning two different vector models attempt to minimize (certain) differences in distances between word vectors across the original model and aligned model. While the differences are guaranteed to be as small as possible, they are not necessarily 0. With large corpora, the differences in distances may be negligible (though I am not aware of any discussion of this point), but given the large amount of variation across models with small corpora, it seems likely that differences in distances could be more relevant. The slope of similarity lines method used here avoids the issue of differences in distances since each model is viewed independently, with no alignment. Distance distortion in a different form is also theme in the visualization post

Discussion and Conclusion

One important point to make with all this variation is that there is no a priori reason to pick one set of parameter settings over another. Each one gives a different perspective on the corpus/corpora, and it is up to the analyst to delve deeper into these perspectives.

At the same time, the issues and approaches raised here are relevant for large corpora as well, especially for lower frequency words. Studies of particular phenomena should do some preliminary testing of the words of interest to see to what extent different parameter settings impact the results. As well, the two techniques used here, the closests of the closests, and the similarity slope, can be addtional tools to use in the analysis of large corpora.

Back to the introduction post

Other posts

References

[1] Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2014. Statistically significant detection of linguistic change. In Proc. 24th WWW Conf., pp. 625–635. International World Wide Web Conferences Steering Committee.

[2] William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic word embeddings reveal historical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[3] Terrence Szymanski. 2017. Temporal Word Analogies: Identifying Lexical Replacement with Diachronic Word Embeddings. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers), pp. 448–453.

[4] Hyperwords: https://bitbucket.org/omerlevy/hyperwords, published as Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, vol. 3, pp. 211–225.

[5] Gensim: https://radimrehurek.com/gensim/, published as: Software Framework for Topic Modelling with Large Corpora. Radim Řehůřek and Petr Sojka. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45-50, 22 May 2010.

---
title: "Word vectors with small corpora:<br>Exploring similarities"
output: html_notebook
fig_caption: yes
---

## © 2018 Chris Culy, April 2018
### [chrisculy.net](http://chrisculy.net/)


## Overview
This is one of a [series of posts](wvecs_intro.html) on using word vectors with small corpora. In this post I discuss ways to explore word similarities that don't compare the model results with precompiled human judgments. In particular, I show how we can use "the closests of the closests" as well as the slope of similar items to get different perspectives on word similarities, both within a corpus and across corpora.


```{r message=FALSE}
library(readr)
library(tidyverse)
library(scales)
library(reticulate)
```

```{r}
prep = FALSE
```

```{r}
# set up python functions
if (prep) {
  use_condaenv("textp") #this must have gensim installed
  
  psys <- import('sys')
  psys$path <- c(psys$path,getwd())
  
  svd_similarity <- import("similarities")
  
  get_most_similar <- function(text,words,n,win,dim,min_count) {
    x <- svd_similarity$get_most_similar(text,words,n=n,win=win,dim=dim,min_count=min_count)
    
    nr = length(x)
    what <- data.frame(matrix(unlist(x), nrow=nr, byrow=T),stringsAsFactors=FALSE)
    names(what) <- c('text','word','win','dim','min_count','item','rank','sim')
    
    return(what)
  }
}

```

```{r message=FALSE}
# calculate/load  data that needs python

if (prep) {
  text <- "vfair"
  n <- 10
  min_count <- NULL
  words <- c("house","horse","awful","life","letters","act","road","listened","pardon","particulars","woke","abominable","doings","alas")
  
  vfair_sims <-
    get_most_similar(text,words,n=n,win=NULL,dim=NULL,min_count=min_count)
  write_tsv(vfair_sims,"sims/vfair_sims.csv")
}
vfair_sims <- read_tsv("sims/vfair_sims.csv")


if (prep) {
  text <- "waywe"
  n <- 10
  min_count <- 1
  words <- c("house","horse")

  waywe_house_horse <-
    get_most_similar(text,words,n=n,win=NULL,dim=NULL,min_count=min_count)
  write_tsv(waywe_house_horse,"sims/waywe_house_horse.csv")
}
#waywe_house_horse <- read_tsv("sims/waywe_house_horse.csv")


if (prep) {
  text <- "moby"
  n <- 10
  min_count <- 1
  words <- c("house","horse")

  moby_house_horse <-
    get_most_similar(text,words,n=n,win=NULL,dim=NULL,min_count=min_count)
  write_tsv(moby_house_horse,"sims/moby_house_horse.csv")
}
#moby_house_horse <- read_tsv("sims/moby_house_horse.csv")


if (prep) {
  text <- "kidnapped"
  n <- 10
  min_count <- 1
  words <- c("house","horse")

  kidnapped_house_horse <-
    get_most_similar(text,words,n=n,win=NULL,dim=NULL,min_count=min_count)
  write_tsv(kidnapped_house_horse,"sims/kidnapped_house_horse.csv")
}
#kidnapped_house_horse <- read_tsv("sims/kidnapped_house_horse.csv")


if (prep) {
  text <- "dracula"
  n <- 10
  words <- c("house","horse","life","awful")

  dracula_sims <-
    get_most_similar(text,words,n=n,win=NULL,dim=NULL,min_count=NULL)
  write_tsv(dracula_sims,"sims/dracula_sims.csv")
}
dracula_sims <- read_tsv("sims/dracula_sims.csv")

if (prep) {
  text <- "jane"
  words <- c("house","horse","life","awful")
  jane_sims <- get_most_similar(text,words,n=n,win=NULL,dim=NULL,min_count=NULL)
  write_tsv(jane_sims, "sims/jane_sims.csv")
}
jane_sims <- read_tsv("sims/jane_sims.csv")

if (prep) {
  text <- "threemen"
  words <- c("house","horse","life","awful")
  threemen_sims <- get_most_similar(text,words,n=n,win=NULL,dim=NULL,min_count=NULL)
  write_tsv(threemen_sims, "sims/threemen_sims.csv")
}
threemen_sims <- read_tsv("sims/threemen_sims.csv")



###############



```


```{r message=FALSE}
#load simple data

cnames <- list('davidc', 'rbadge', 'dracula', 'moby', 'scarlet', 'emma', 'moonstone', 'frankenstein', 'pym', 'sybil', 'heartd', 'grubb', 'threemen', 'jane', 'nabbey', 'vfair', 'dorian', 'waywe', 'kidnapped', 'wuthering')

eval_texts <- list('sign4','jude','midmarch')
test_names <- list("ws353","ws353_similarity","ws353_relatedness","bruni_men")

counts <- Reduce(rbind,lapply(cnames, function(c){
  fname <- paste0("counts/",c,"-counts.csv")
  these_counts <- read_delim(fname, "\t", escape_double = FALSE, trim_ws = TRUE) %>%
    mutate(text=c, rank=row_number()) 
  
  #now add rank percentile [NOT count percentile, which isn't useful]
  len <- nrow(these_counts)
  these_counts %>% mutate(percentile=round(100*(1-rank/len), digits=2))
  
})) %>% select(text, everything())



best_SVD_sim_scores <- read_delim("testsets_parameters/SVD-sim_tests.csv", "\t", escape_double = FALSE, trim_ws = TRUE) %>%
  filter(!(text %in% eval_texts)) %>%
  group_by(testset,text) %>%
  filter(sF1 == max(sF1))

```

## Background
In the previous post, [Finding useful parameters and methods](wvecs_testsets_parameters.nb.html), I discussed finding the values for the parameters window size (_win_), number of dimensions (_dim_), and the minimum frequency of words to be included (_min_count_) that give the highest sF1 scores on four standard testsets of human similarity judgments. However, as mentioned at the end of that post, it is reasonable to question whether those testsets are relevant to evaluating these small corpora (recall that I am using 19th century novels as my corpora). 

One problem with the testsets in this context is that they contain a lot of vocabulary that is not found in any single text. We saw that the best recall was still less than 75%, and with smaller texts it was not uncommon for recall to be less than 30%. The testsets then tell us little about the vocabulary that _is_ in the texts.

A second problem with the testsets is that they reflect 21st century judgments about word usage, and we know that word usage (and meaning) has changed since the 19th century. In fact, there have been recent papers using word vectors to quantify those changes ([[1]](#ref1), [[2]](#ref2), [[3]](#ref3)). In other words, the testsets can only tell us about 21st century interpretations of 19th century usage. While that might be interesting in its own right, it is not a way to discover how 19th century authors used words.

A third issue in using word vectors to explore word similarities is that different parameter settings give different similarities. Here is one example from _Vanity Fair_. I've chosen the word _house_, as the most common noun (_lady_ is more common but it can also be a title, as in _lady jane_, since we have lowercased everything in the preprocessing.) I've fixed the window size at 5, and for each dimension in the [parameters experiment](wvecs_testsets_parameters.nb.html), we see the 3 words that the model judges as being the most similar to _house_.

Throughout this post, as in the [parameters experiment](wvecs_testsets_parameters.nb.html), I am using ppmi_svd vectors created using hyperwords [[4]](#ref4), with similarities calculated using gensim [[5]](#ref5)).

```{r fig.width=6, fig.height=6}
vfair_house_horse <- vfair_sims %>% filter(min_count==1,word=='house' | word=='horse')
vfair_house_horse %>% 
  filter(word=='house',win==5, rank<4) %>%
  ggplot() + theme_classic() + 
  labs(title="Closest 3 words to 'house' in Vanity Fair, with win=5,min_count=1") +
  scale_y_continuous(limits = c(0,1)) +
  geom_text(aes(dim,sim,label=item, color=factor(rank)), alpha=0.75, size=4, show.legend = FALSE)

```

We can immediately see two  issues:

* the most similar words vary from setting to setting
* the similarity (sim) varies from setting to setting

A third issue is that although win=4, dim=400 was the best scoring setting for the testsets, it gives the lowest similarity scores for _house_. In other words, a model that does well on the testsets will not necessarily give the highest similarity scores.

If we add in the closest words to _horse_, we see that similarity scores across words varies, even within the same settings. For example, the closest words to _horse_ with dim=25 have a similarity of a little more than 0.75, while for _house_ the scores are closer to 0.9.

```{r fig.width=6, fig.height=9}
vfair_house_horse %>% 
  filter(win==5, rank<4) %>%
  ggplot() + theme_classic() + 
  labs(title="Closest 3 words to 'house' and 'horse' in Vanity Fair\nwin=5, min_count=1") +
  scale_y_continuous(limits = c(0,1)) +
  scale_x_continuous(limits = c(0,450)) +
  geom_text(aes(dim,sim,label=item, color=factor(rank)), size=4, alpha=0.75, show.legend = FALSE) + 
  facet_wrap(~word, ncol=1)


```

Finally, when we look at a different text, _Dracula_, we see different words (not surprisingly), and different scores. There is also a slight difference in the _trends_ of the scores across the dimensions.

```{r fig.width=6, fig.height=9}
dracula_house_horse <- dracula_sims %>% filter(min_count==1,word=='house' | word=='horse')

dracula_house_horse %>% 
  filter(win==5, rank<4) %>%
  ggplot() + theme_classic() + 
  labs(title="Closest 3 words to 'house' and 'horse' in Dracula\nwin=5, min_count=1") +
  scale_y_continuous(limits = c(0,1)) +
  scale_x_continuous(limits = c(0,450)) +
  geom_text(aes(dim,sim,label=item, color=factor(rank)), size=4, show.legend = FALSE) + 
  facet_wrap(~word, ncol=1)
```


```{r fig.width=6, fig.height=8}
rbind(vfair_house_horse,dracula_house_horse) %>%
  filter(win==5, rank==1) %>%
  ggplot() + theme_classic() + 
  labs(title="Trends for closest word to house, horse\nin Vanity Fair, Dracula, win=5, min_count=1") +
  scale_y_continuous(limits = c(0,1)) +
  geom_line(aes(dim,sim,color=text)) + 
  geom_point(aes(dim,sim,color=text, shape=text), size=2) +
  facet_wrap(~word, ncol=1)
```
There are thus three fundamental issues:

* How to deal with variation of a single word across multiple models
* How to deal with variation of multiple words within a single model
* How to deal with variation across corpora

I will address each of these in turn.

## Exploring variation across models

Here is a selection of words in _Vanity Fair_ of different frequencies, chosen from the top 85 to 99 percentiles by rank. I'll draw from these words in the examples that follow.

```{r}
counts %>% filter(text=='vfair', percentile %in% seq(from=85,to=99,by=2)) %>% arrange(rank)
```

### Closest of the closest

Given a word, e.g. _life_, one thing we do is look at the closest words to it across all the parameters, and find which of those has the highest similarity -- the closest of the closest. 

As an aside, we can note that for this sample, dim is always 25 (the smallest number of dimensions tested), min_count is usually, but not always 1, and win varies. This is in contrast to the results in the [testset evaluations](wvecs_testsets_parameters.nb.html), where min_count=1 always gave the best sF1 scores across testsets, while dim and win varied.

```{r}
vfair_sims_reduced <- vfair_sims %>% 
  filter(word !=  'house' & word != 'horse') %>%
  left_join(counts, by=c('text','word'='item')) %>%
  rename(text_rank=rank.y,sim_rank=rank.x)


vfair_sims_reduced %>% 
  group_by(word) %>%
  filter(sim==max(sim)) %>%
  select(word,percentile,item,sim,win,dim,min_count) %>%
  arrange(-sim, word, item)

```

The similarities range from just over 0.80 to little over 0.93, and there seems to be no strong relation between the rank percentile and the similarity of the closest of the closest.

```{r fig.height=7}
vfair_sims_reduced %>% 
  group_by(word) %>%
  filter(sim==max(sim)) %>%
  ggplot() + theme_classic() +
  labs(title="Closest of the closest for selected words in Vanity Fair") +
  scale_x_continuous(limits = c(84,100)) +
  geom_text(aes(percentile,sim, label=paste0(word,":",item))) +
  geom_smooth(aes(percentile,sim), method='loess', color="orange")
```

If we want to explore a particular word with respect to other words, we can choose the model with the parameter settings for the closest of the closest and procede from there.

### Slope

Let's take a look at the 5 words closest to _life_ across parameters, by their similarities. I'll limit the similarities to > 0.66 for clarity.

```{r fig.height=12, fig.width=8}
vfair_sims_reduced %>% 
  filter(word=='life', sim_rank<6, sim>0.66) %>%
  ggplot() + theme_classic() +
  labs(title="5 closest words to 'life' in Vanity Fair (sim>0.66), by win and dim") +
  scale_x_continuous(limits = c(0.5,5.5)) +
  geom_line(aes(sim_rank,sim, color=factor(min_count))) +
  geom_text(aes(sim_rank,sim,label=item, color=factor(min_count)), show.legend = FALSE) +
  facet_wrap(win ~ dim, ncol = 2)
```

We see again here that the similarity scores vary quite a bit. We can also see that the _range_ of similarity scores varies from the 1st to 5th closest word for a given set of parameters. While there are different ways we might use this information, one way is to look at the _slope_ of the line going through the first and last words. By using the slope instead of the range, we abstract away from the absolute similarity values.

To find potentially interesting closest items, we can look for the model that has the steepest slope and the model that has the shallowest. (There are other things we could do as well, but this is a start.)


```{r}
show_min_max_slopes <- function(info,wd,nclosest=5,tname=NULL) {
  dx = nclosest-1
  slopes <- info %>% 
    filter(word == wd, sim_rank <= nclosest) %>%
    group_by(text,win,dim,min_count) %>%
    mutate(slope=(max(sim)-min(sim))/dx) %>%
    ungroup()
  
  slopes_minmax <- slopes %>% 
    group_by(text) %>%
    filter(slope == max(slope) | slope == min(slope))
  
  if (is.null(tname)) {
    tname <- info$text[[1]]
  }
  
  t <- paste0("Steepest and shallowest slopes\nfor the ", 
             nclosest, ' words closest to "', wd, '" in ', tname)
  
  g <- slopes_minmax %>%
    ggplot() + theme_classic() + labs(title=t) + 
    scale_x_continuous(limits = c(0.5,nclosest+1.5), breaks=seq(1,nclosest)) +
    geom_line(aes(sim_rank,sim, color=factor(slope), alpha=0.75), show.legend = FALSE) +
    geom_text(aes(sim_rank,sim,label=item, color=factor(slope)), show.legend = FALSE) +
    geom_text(data=filter(slopes_minmax,sim_rank==nclosest),
              aes(nclosest+1,sim,
                  label=sprintf('(%0.5f)',slope), 
                  color=factor(slope)),
              show.legend = FALSE)
  g
}

```


```{r}
show_min_max_slopes(vfair_sims_reduced,'life',nclosest = 5, tname="Vanity Fair")
```

Of course, there is nothing special about choosing the 5 closest items. We might choose just 2, or maybe 10.

```{r}
show_min_max_slopes(vfair_sims_reduced,'life',nclosest = 2, tname="Vanity Fair")
```

```{r fig.height=6, fig.width=10}
show_min_max_slopes(vfair_sims_reduced,'life',nclosest = 10, tname="Vanity Fair")
```

As we can see, the closest words vary dramatically not only by the slope, but by how many we choose to focus on. The reason for the differences across the number of words chosen has two aspects. The first aspect is that different parameter settings give different similarities, as we've seen numerous times, and of course that's how we can have a difference between the steepest and shallowest slope in the first place.

The second aspect has to do with the behavior of the similarity lines: they are not straight lines, but a series of segments, each of which may have a different slope. The slope of the whole series is the slope of a line from the first item to the last item, not taking into account the internal variations. So a series which may have a steep slope over the first few items might level out and have a shallower overall slope. Similarly, a series which starts out with a shallow slope may decline sharply, and end up with steeper slope overall.

We can see this illustrated in the following two charts. In each, I have used the subset of 9 parameter settings which rank _first_ as the most similar word to _life_. The first chart shows the similarity lines and their slope over 5 words; the second chart shows them over 10 words.

Over 5 words, the sequence starting <_first_, _station_, _squeezed_> has the steepest slope (0.02191), but over 10 words the corresponding sequence has a slope of (0.01098), which is _not_ the steepest slope. Rather, the sequence starting with <_first_, _part_, _every_> has the steepest slope over 10 words (0.01601), although its slope over 5 words was not the steepest, at (0.01692). 

Although the sequence that has the shallowest slope is the same one over 5 and 10 words, starting with <_first_, _days_, _delirium_), this is a coincidence: it doesn't have the shallowest slope over 3 words (not shown here).

```{r message=FALSE}

vfair_life_first <- vfair_sims_reduced %>% 
  filter(word=='life', (item=='first' & sim_rank==1)) %>%
  select(word,win,dim,min_count) %>% unique() %>%
  left_join(vfair_sims_reduced)

compare_life_first <- function(nclosest) {
  dx = nclosest-1
  
  slopes <- vfair_life_first %>%
    filter(sim_rank <= nclosest) %>%
    group_by(win,dim,min_count) %>%
    mutate(slope=(max(sim)-min(sim))/dx) %>%
    ungroup()
  
  t <- paste("Comparison of slopes for 'life' with 'first' as closest,\nnclosest = ",nclosest)
 
  g <- slopes %>%
    ggplot() + theme_classic() + labs(title=t) + 
    scale_x_continuous(limits = c(0.5,nclosest+1.5), breaks=seq(1,nclosest)) +
    geom_line(aes(sim_rank,sim, color=factor(slope), alpha=0.75), show.legend = FALSE) +
    geom_text(aes(sim_rank,sim,label=item, color=factor(slope)), show.legend = FALSE) +
    geom_text(data=filter(slopes,sim_rank==nclosest),
              aes(nclosest+1,sim,
                  label=sprintf('(%0.5f)',slope), 
                  color=factor(slope)),
              show.legend = FALSE)
  g
}
```

```{r fig.height=6, fig.width=6}
compare_life_first(5) 
```

```{r fig.height=6, fig.width=9}
compare_life_first(10) 
```

## Exploring variation within a single model

We can also explore multiple words within a single model using the notion of slope. For example, we might choose a large-ish window, since that should give more "semantic" results, according to the literature. We'll pick a window of 10, and arbitrarily fix dim=400 and min_count=1.

```{r}
show_model_slopes <- function(info,wds,nclosest = 5,tname=NULL) {
  dx <- nclosest-1
  slopes <- info %>% 
      filter(word %in% wds, sim_rank <= nclosest) %>%
      group_by(word,win,dim,min_count) %>%
      mutate(slope=(max(sim)-min(sim))/dx) %>%
      ungroup()
  
  if (is.null(tname)) {
    tname <- info$text[[1]]
  }
  
  wds_str <- paste(wds, collapse = ", ")
  t <- paste0("Slopes for the ", nclosest, ' words closest to\n', wds_str, '\nin ', tname)
  
  
  
  g <- slopes %>%
    ggplot() + theme_classic() + labs(title=t) + 
    scale_x_continuous(limits = c(0.5,nclosest+2.0), breaks=seq(1,nclosest)) +
    geom_line(aes(sim_rank,sim, color=word, alpha=0.75), show.legend = FALSE) +
    geom_text(aes(sim_rank,sim,label=item, color=word), show.legend = FALSE) +
    geom_text(data=filter(slopes,sim_rank==nclosest),
              aes(nclosest+1,sim,
                  label=sprintf('(%s: %0.5f)',word,slope), 
                  color=word),
              show.legend = FALSE)
  
  
  g

}

```

```{r fig.width=8, fig.height=10}

nclosest <- 4
w <- 10
d <- 400
mc <- 1
wds <- unique(vfair_sims_reduced$word)

info <- vfair_sims_reduced %>% 
  filter(win==w,dim==d,min_count==mc)

show_model_slopes(info, nclosest = nclosest, wds = wds, tname = paste0("Vanity Fair, with win=",w," dim=",d,", and min_count=",mc))
```

We might also try parameters that according to the literature should give more syntax-oriented results, namely with a small window. We'll pick a window of 2, and arbitrarily fix dim=100 and min_count=1.

```{r fig.width=8, fig.height=10}

nclosest <- 4
w <- 2
d <- 100
mc <- 1
wds <- unique(vfair_sims_reduced$word)

info <- vfair_sims_reduced %>% 
  filter(win==w,dim==d,min_count==mc)

show_model_slopes(info, nclosest = nclosest, wds = wds, tname = paste0("Vanity Fair, with win=",w," dim=",d,", and min_count=",mc))
```

## Exploring variation across corpora

### The test data: _awful_
We can use _awful_ as an example to compare words across corpora, one that is used in papers on word vectors and semantic change. We'll use four texts, two from the mid 1800s and two from the late 1800s. This fits roughly with one timespan showing change as in ([[2]](#ref2)). (Interestingly _Emma_, which is even earlier (1815), does not contain the word _awful_.) We expect that the two earlier books would use _awful_ differently from the two later books.

* Jane Eyre: 1847
* Vanity Fair: 1847
* Three Men in a Boat: 1889
* Dracula: 1897


```{r}
#using vfair,jane,dracula,threemen
awful_sims <- rbind(vfair_sims,dracula_sims,jane_sims,threemen_sims) %>%
  filter(word=="awful") %>%
  left_join(counts, by=c('text','word'='item')) %>%
  rename(text_rank=rank.y,sim_rank=rank.x)
```

### The closests of the closests
Here we see that the closests of the closest vary across all the texts. 
```{r}
awful_sims %>% 
  group_by(text) %>%
  filter(sim==max(sim)) %>%
  arrange(-sim)

```

We can tally the number of models for which an item is the closest to _awful_. For brevity, I've limited this to items which occurs as the closest item at least 5 times. There are certainly things that deserve a closer look, like _an_ showing up in 5 different models for _Three Men in a Boat_. However, we'll keep moving.

```{r}
awful_sims %>% 
  filter(sim_rank==1) %>%
  group_by(text,item) %>%
  summarize(n=n()) %>%
  filter(n>=5) %>%
  arrange(text,-n,item)
```

Are there any overlaps in the closest of the closest across texts? Just 4, which isn't very many, and they all occur in _Vanity Fair_.
```{r}
awful_sims %>% 
  filter(sim_rank==1) %>%
  group_by(item,text) %>%
  summarize(n=n()) %>%
  group_by(item) %>%
  summarize(ntexts=n()) %>%
  filter(ntexts>1) %>%
  inner_join(filter(awful_sims, sim_rank==1), by=c("item")) %>%
  select(item,text) %>% 
  unique() %>%
  arrange(item,text)

```

We can expand to, for example, the 5 closest.
```{r fig.height=6, fig.width=4}
awful5 <- awful_sims %>% 
  filter(sim_rank<=5) %>%
  group_by(item,text) %>%
  summarize(n=n()) %>%
  group_by(item) %>%
  summarize(ntexts=n()) %>%
  filter(ntexts>1) %>%
  inner_join(awful_sims, by=c("item")) %>%
  select(item,text) %>% 
  unique() 

t <- sprintf("%d items are in the closest 5 to 'awful'\nin more than one text", length(unique(awful5$item)))

awful5 %>%
  ggplot() + theme_classic() + labs(title=t) +
  scale_y_discrete(limits = sort(unique(awful5$item), decreasing = TRUE)) +
  geom_point(aes(text,item,color=text), show.legend = FALSE)


```

We can compare these texts using the settings from one of the studies in the literature. In [[2]](#ref2) they use win=4 and dim=300. We can approximate those with win=5 and dim=200 (or dim=400). They use different min_count values for different corpora, but well above ours: ours are 1, 3, 5, 10, 20 while theirs are 100 and 500.

There is very little overlap across the texts (with similar results for dim=400 and even increasing the n-closest to 10).

```{r fig.height=10, fig.width=4}
nclosest = 5
d = 200

awful_approx <- awful_sims %>%
  filter(win==5,dim==d,min_count<5,sim_rank<=nclosest)
  
t = sprintf("Closest %d items with dim=%d,\nand min_count is 1 or 3", nclosest, d)

awful_approx %>%  
  ggplot() + theme_classic() + labs(title=t) +
  scale_y_discrete(limits = sort(unique(awful_approx$item), decreasing = TRUE)) +
  geom_point(aes(text,item,color=text), show.legend = FALSE)
```

### Similarity lines
Since there is little overlap among the closest words across texts, we might use the slope technique to get a different idea of what's going on.

```{r fig.height=6, fig.width=8}

show_awful_slopes <- function(info,nclosest = 5,tname=NULL) {
  dx <- nclosest-1
  slopes <- info %>% 
      filter(sim_rank <= nclosest) %>%
      group_by(text) %>%
      mutate(slope=(max(sim)-min(sim))/dx) %>%
      ungroup()
  
  if (is.null(tname)) {
    tname <- ""
  }
  
  t <- paste0("Slopes for the ", nclosest, ' words closest to awful ',tname)
  
  g <- slopes %>%
    ggplot() + theme_classic() + labs(title=t) + 
    scale_x_continuous(limits = c(0.5,nclosest+2.0), breaks=seq(1,nclosest)) +
    geom_line(aes(sim_rank,sim, color=text, alpha=0.75), show.legend = FALSE) +
    geom_text(aes(sim_rank,sim, label=item, color=text), show.legend = FALSE) +
    geom_text(data=filter(slopes,sim_rank==nclosest),
              aes(nclosest+1,sim,
                  label=sprintf('(%s: %0.5f)',text,slope), 
                  color=text),
              show.legend = FALSE)
  g
}

nclosest = 5
d = 200

awful_approx <- awful_sims %>%
  filter(win==5,dim == d, min_count==1) #,sim_rank<=nclosest)
  
t = sprintf("with win=5, dim=%d, and min_count=1\n", d)

show_awful_slopes(awful_approx, nclosest = nclosest, tname = t)

```

```{r fig.height=6, fig.width=8}
d = 400

awful_approx <- awful_sims %>%
  filter(win==5,dim == d, min_count==1)
  
t = sprintf("with win=5, dim=%d, and min_count=1\n", d)

show_awful_slopes(awful_approx, nclosest = nclosest, tname = t)

```

Not surprisingly, the closest words are different when we change the number of dimensions, though there is overlap, and in the case of _threemen_, the overlap (and the ordering) is exact.

As before, we can also find the steepest and shallowest slopes for _awful_ for each text.

```{r fig.height=13, fig.width=8}
awful_sims %>%
  show_min_max_slopes(wd = 'awful',nclosest = 5, tname = "the 4 texts") +
  facet_wrap(~text, ncol=1)
```


### On alignment

In work using word vectors to explore meaning shift across time, word vectors from different time slices are _aligned_ to a common coordinate system ([[1]](#ref1), [[2]](#ref2), [[3]](#ref3)). However, there are two significant issues with trying to align these small corpora, namely the large non-overlap of vocabulary across the texts and the distortion of distances.

In order to align two different vector space models, only their common vocabulary can be used (possibly just a subset, even). However, in the case of these small corpora, not only is their overlap relatively constrained, but we cannot be sure, a priori, that the non-shared words are not important for the vector of the word(s) we are interested in. Therefore, alignment is not an appropriate technique for comparing small corpora. For large corpora, the _assumption_ is that all the important words will be in the shared vocabulary, and that is probably a reasonable assumption. 

#### Shared (total) vocabulary across texts
e.g. ~ 49% of **vfair** is shared with **jane** while ~ 62% of **jane** is shared with **vfair**

```{r}
txts = c("dracula","jane","threemen","vfair")

shared_vocab <- sapply(txts, function(t1) {
  t1_wds <- filter(counts,text==t1)
  
  x <- sapply(txts,function(t2) {
    t2_wds <- filter(counts,text==t2)
    
    info <- t2_wds %>% 
      filter(item %in% t1_wds$item)
    
    return( nrow(info)/nrow(t2_wds) )
    
    })
  return(x)
})

shared_vocab
```

The second issue has to do with distance distortion. The methods for aligning two different vector models attempt to minimize (certain) differences in distances between word vectors across the original model and aligned model. While the differences are guaranteed to be as small as possible, they are not necessarily 0. With large corpora, the differences in distances may be negligible (though I am not aware of any discussion of this point), but given the large amount of variation across models with small corpora, it seems likely that differences in distances could be more relevant. The slope of similarity lines method used here avoids the issue of differences in distances since each model is viewed independently, with no alignment. Distance distortion in a different form is also theme in the [visualization post](wvecs_visualization.html)

## Discussion and Conclusion

One important point to make with all this variation is that there is no a priori reason to pick one set of parameter settings over another. Each one gives a different _perspective_ on the corpus/corpora, and it is up to the analyst to delve deeper into these perspectives.

At the same time, the issues and approaches raised here are relevant for large corpora as well, especially for lower frequency words. Studies of particular phenomena should do some preliminary testing of the words of interest to see to what extent different parameter settings impact the results. As well, the two techniques used here, the closests of the closests, and the similarity slope, can be addtional tools to use in the analysis of large corpora.

[Back to the introduction post](wvecs_intro.html)

Other posts

* [Stabilizing randomness](wvecs_random_fix.html)
* [A new measure for evaluation](wvecs_evaluation_measure.html)
* [Finding useful parameters and methods](wvecs_testsets_parameters.nb.html)
* [Visualizing word vectors](wvecs_visualization.html)

## References

<span id="ref1">[1]</span> Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2014. Statistically significant detection of linguistic change. In Proc. 24th WWW Conf., pp. 625–635. International World Wide Web Conferences Steering Committee.

<span id="ref2">[2]</span> William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic word embeddings reveal historical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

<span id="ref3">[3]</span> Terrence Szymanski. 2017. Temporal Word Analogies: Identifying Lexical Replacement with Diachronic Word Embeddings. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers), pp. 448–453.

<span id="ref4">[4]</span> Hyperwords: https://bitbucket.org/omerlevy/hyperwords, published as Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, vol. 3, pp. 211–225.

<span id="ref5">[5]</span> Gensim: https://radimrehurek.com/gensim/, published as: Software Framework for Topic Modelling with Large Corpora. Radim Řehůřek and Petr Sojka. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45-50, 22 May 2010.
