Avatar Master of Data Science Candidate @ Univ of California San Diego, Data Engineer Intern @ Gotion, Email: huyuan17@gmail.com

Building a "no sweat" Word Embeddings

Introduction

The large number of English words can make language-based applications daunting. To cope with this, it is helpful to have a clustering or embedding of these words, so that words with similar meanings are clustered together, or have embeddings that are close to one another.

But how can we get at the meanings of words?

“You shall know a word by the company it keeps.”

  • John Firth (1957)

That is, words that tend to appear in similar contexts are likely to be related.

In this article, I will investigate this idea by coming up with a 100 dimensional embedding of words that is based on co-occurrence statistics.

The description here assumes you are using Python with NLTK.

import numpy as np
import pandas as pd

import nltk
nltk.download('brown')
from nltk.corpus import brown
from nltk.corpus import stopwords

import collections
from string import digits, punctuation
[nltk_data] Downloading package brown to
[nltk_data]     /Users/galaxie500/nltk_data...
[nltk_data]   Package brown is already up-to-date!
  • Prepare data

First, download the Brown corpus (using nltk.corpus). This is a collection of text samples from a wide range of sources, with a total of over a million words. Calling brown.words() returns this text in one long list, which is useful.

print(type(brown.words()))
words = list(brown.words())
print(f"Length of raw Brown corpus: {len(words)}")
<class 'nltk.corpus.reader.util.ConcatenatedCorpusView'>
Length of raw Brown corpus: 1161192

(a) Step-by-Step: build a 100-dimensional embeddings

  • Lowercase and Remove stopwords, punctuation

Remove stopwords and punctuation, make everything lowercase, and count how often each word occurs. Use this to come up with two lists:

  1. A vocabulary V, consisting of a few thousand (e.g., 5000) of the most commonly-occurring words.
  2. A shorter list C of at most 1000 of the most commonly-occurring words, which we shall call context words.
# clean text data
def preprocessing(corpus):
    for i in range(len(corpus)):
        corpus[i] = corpus[i].translate(str.maketrans('', '', punctuation))
        #corpus[i] = corpus[i].translate(str.maketrans('', '', digits))
        corpus[i] = corpus[i].lower()
    
    stop_words = set(stopwords.words())
    corpus = [w for w in corpus if w!='' if not w in stop_words] #remove empty string and stopwords
    return corpus
words = preprocessing(words)
  • Generate vocabulary V and shorter list C
def extract_frequent_words(word_list, freq=1000):
    count = collections.Counter(word_list)
    word_count = {'Word':list(count.keys()), 'Count':list(count.values())}
    df = pd.DataFrame(data=word_count)
    df = df.sort_values(by='Count', ascending=False)
    selected = df[0:freq]
    selected = selected.sort_index()
    return selected
C_df = extract_frequent_words(words, freq=1000)
C_list = C_df.Word.values
V_df = extract_frequent_words(words, freq=5000)
V_df.columns = ['V_words', 'window_Count']
V_list = V_df.V_words.values

For each word \(w \in V\), and each occurrence of it in the text stream, look at the surrounding window of four words (two before, two after):

\[w_1 \quad w_2 \quad w \quad w_3 \quad w_4\]

Keep count of how often context words from \(C\) appear in these positions around word \(w\). That is, for \(w \in V\), \(c \in C\), define:

\(n(w,c) =\) # of times c occurs in a window around \(w\)

Using these counts, construct the probability distribution \(Pr(c \vert w)\) of context words around \(w\)(for each \(w \in V\)), as well as the overall distribution Pr(c) of context words. These are distributions over \(C\).

  • Calculate \(Pr(c \vert w)\)
# track all the positions that word w showed up in the text stream
V_pos = [[x for x, n in enumerate(words) if n == w] for w in V_list]
Windows = []
context_word_count = []
for pos in V_pos:
    window = [] # context words surrounding w for w in V
    for i in pos:
        if i==0:
            cur_window = words[1:3]
        elif i==1:
            cur_window = [words[0]] + words[2:4]
        else:
            cur_window = words[i-2:i] + words[i+1:i+3]
        
        # exclude duplicate context words in a single window
        cur_unique = []
        for j in cur_window:
            if j not in cur_unique:
                cur_unique.append(j)
        
        window += cur_unique
    
    # count occurrence of each context word for words in window w
    context_word_count += list(collections.Counter(window).values())
    
    # remove duplicated context words
    context_word_unique = []
    for k in window:
        if k not in context_word_unique:
            context_word_unique.append(k)
            
    Windows.append(context_word_unique)
V_df['context_words'] = Windows
V_df = V_df.explode('context_words')
V_df['countext_word_count'] = context_word_count
V_df['Pr(c|w)'] = V_df['countext_word_count'].astype('float')/V_df['window_Count']
V_df['in_C'] = [1 if i in C_list else 0 for i in V_df.context_words.values]
# drop all the rows with context words that are not in our shorter list C
V_df = V_df[V_df.in_C==1]

# drop boolean column 'in_C' after we collect all context words belonged to C
V_df.drop(columns='in_C')

Output:

V_words window_Count context_words countext_word_count Pr(c|w)
1 county 155 general 1 0.006452
1 county 155 none 1 0.006452
1 county 155 future 1 0.006452
1 county 155 doctor 1 0.006452
1 county 155 cent 1 0.006452
... ... ... ... ... ...
47101 letch 19 person 1 0.052632
47101 letch 19 took 1 0.052632
47101 letch 19 energy 1 0.052632
47101 letch 19 said 1 0.052632
47101 letch 19 interested 1 0.052632

461410 rows × 5 columns

  • Calculate Pr(c)
# calculate distribution of context words in C
C_list_uniq = list(V_df.context_words.unique())
Pr_C = {}
for c in C_list_uniq:
    Pr_C[c] = words.count(c) / len(words)
V_df['Pr(c)'] = V_df.loc[:, 'context_words'].apply(lambda x: Pr_C[x])
V_df

Output:

V_words window_Count context_words countext_word_count Pr(c|w) in_C Pr(c)
1 county 155 general 1 0.006452 1 0.000950
1 county 155 none 1 0.006452 1 0.000206
1 county 155 future 1 0.006452 1 0.000433
1 county 155 doctor 1 0.006452 1 0.000191
1 county 155 cent 1 0.006452 1 0.000296
... ... ... ... ... ... ... ...
47101 letch 19 person 1 0.052632 1 0.000332
47101 letch 19 took 1 0.052632 1 0.000813
47101 letch 19 energy 1 0.052632 1 0.000191
47101 letch 19 said 1 0.052632 1 0.003741
47101 letch 19 interested 1 0.052632 1 0.000200

461410 rows × 7 columns

  • Calculate pointwise mutual information
Represent each vocabulary item \(w\) by a C -dimensional vector \(\phi(w)\), whose \(c\)’th coordinate is:
\[\phi_c(w) = max(0, \frac{logPr(c|w)}{Pr(c)})\]

This is known as the (positive) pointwise mutual information, and has been quite successful in work on word embeddings.

import math 

def pointwise_mutual_info(row):
    a = row['Pr(c|w)']
    b = row['Pr(c)']
    res = math.log(a/b)
    return max(0, res)
V_df['Phi'] = V_df.apply(pointwise_mutual_info, axis =1)
V_df

Output:

V_words window_Count context_words countext_word_count Pr(c|w) in_C Pr(c) Phi
1 county 155 general 1 0.006452 1 0.000950 1.915628
1 county 155 none 1 0.006452 1 0.000206 3.444097
1 county 155 future 1 0.006452 1 0.000433 2.701278
1 county 155 doctor 1 0.006452 1 0.000191 3.521058
1 county 155 cent 1 0.006452 1 0.000296 3.082803
... ... ... ... ... ... ... ... ...
47101 letch 19 person 1 0.052632 1 0.000332 5.066159
47101 letch 19 took 1 0.052632 1 0.000813 4.170775
47101 letch 19 energy 1 0.052632 1 0.000191 5.620044
47101 letch 19 said 1 0.052632 1 0.003741 2.644005
47101 letch 19 interested 1 0.052632 1 0.000200 5.571254

461410 rows × 8 columns

  • Dimension Reduction

Before conducting dimension reduction, we have to create a sparse matrix with rows as V_words and columns as context words, the value will be the pointwise mutual information.

sparse_table = pd.pivot_table(V_df, index = 'V_words', columns = 'context_words', values = 'Phi')
sparse_table

Output:

context_words 1 10 100 12 15 1959 1960 1961 2 20 ... written wrong wrote year years yes yet york young youre
V_words
0 5.973217 NaN NaN NaN NaN NaN NaN NaN 4.629721 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 3.965613 3.781690 3.340674 1.984232 3.333348 NaN 3.328978 3.355489 4.750349 2.213074 ... 1.830082 NaN NaN 2.323740 0.651427 NaN NaN 1.156607 0.913791 NaN
10 3.648159 NaN 4.766189 4.731503 4.471181 NaN 3.550520 NaN NaN 4.449519 ... 2.967915 NaN 2.80637 2.614275 3.804163 NaN NaN 2.294440 2.051624 NaN
100 3.340674 4.766189 NaN NaN NaN 3.919708 3.397186 NaN NaN NaN ... NaN NaN NaN NaN 2.734538 NaN NaN NaN NaN NaN
1000 4.697981 NaN 4.989332 NaN 4.694324 NaN NaN NaN NaN NaN ... NaN NaN NaN 4.223713 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
youth NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 3.502838 NaN
youve NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 3.928008 4.105148 NaN NaN 2.056206 NaN 2.92709 NaN NaN 3.947681
zen NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
zero NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
zg NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5000 rows × 1000 columns

Here, we will be doing Singular Value Decomposition(SVD) to decompose the above sparse matrix to 100 dimension word representation.

# fill NaN with 0s
sparse_table = sparse_table.fillna(0)
from sklearn.decomposition import TruncatedSVD
#from sklearn.random_projection import sparse_random_matrix

X = np.asarray(sparse_table)

svd = TruncatedSVD(n_components = 100, n_iter = 5, random_state = 42)
svd.fit(X)

X_new = svd.fit_transform(X)
df_new = pd.DataFrame(X_new, index = sparse_table.index)
df_new

Output:

0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
V_words
0 5.542319 -1.995680 1.381643 -2.909863 -3.406599 -1.401489 -3.144874 0.091159 0.404984 -2.519494 ... -0.272699 0.328477 0.218683 -0.656883 -0.016048 -1.542390 1.702155 -0.510453 -1.331743 0.531266
1 23.523418 -12.896316 11.089762 -3.408536 -9.895360 -3.761695 0.060779 7.032133 -1.337786 -3.057331 ... -1.908982 1.097818 -1.488350 -1.397893 1.135940 -0.119767 0.577449 -0.072922 1.941455 1.521440
10 17.485695 -4.392681 13.806938 -0.205896 -9.351523 -3.323489 2.553425 -0.650727 4.434021 -3.258592 ... -1.889507 -1.311489 0.508911 -1.792407 0.841425 1.114498 0.939590 1.554220 -1.658337 1.261531
100 12.115965 -3.480162 7.104189 -1.246714 -4.169442 -0.304919 1.910621 -2.503645 2.942702 -1.915575 ... 1.479620 0.117896 -0.171019 1.062107 1.123733 -0.680444 -1.128887 -1.518986 1.168567 0.436447
1000 6.748229 -3.263267 3.999501 -1.575692 -4.276787 -0.313027 1.567308 -1.057089 2.655190 -2.324639 ... 0.150966 1.584698 1.008214 1.805991 -1.644718 0.632643 0.848221 -0.695705 0.090873 -0.922286
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
youth 13.327412 0.517403 -2.878062 -0.924167 2.294374 -0.957506 -0.091582 -3.612247 -1.984099 -1.501401 ... 0.115512 0.774741 1.203958 3.243366 -1.524132 -0.470964 -1.921117 1.201018 2.303047 -0.377757
youve 13.042776 8.514331 -6.500516 2.708451 -5.992631 1.492063 0.820251 -0.925629 0.217164 -1.467112 ... -0.712831 -1.747105 -1.294000 0.663011 -0.229572 0.423359 -1.324096 -0.035515 -0.995549 -0.576969
zen 4.458987 -1.740182 -1.933253 -1.843287 0.210213 -1.749685 -0.096430 -1.115299 -0.678715 1.249615 ... -1.146979 -0.144040 -0.256134 -0.293627 0.389218 0.439939 -0.344095 -0.105342 0.127606 -1.844329
zero 6.126870 0.258309 0.348387 -3.972732 -0.751711 -1.251426 -2.420377 1.122941 0.488333 -0.156903 ... -0.210953 0.934246 -1.952539 1.080573 0.198827 -0.895675 -1.193648 -0.156535 -0.424499 1.452921
zg 3.466457 -0.743264 -0.012059 -1.887242 -0.635641 -1.200829 -3.495046 1.720301 0.282265 0.022144 ... 0.220086 -0.863757 0.438066 1.188732 0.514399 0.546616 -0.517487 0.494538 0.355671 -0.530369

5000 rows × 100 columns

(b) Nearest neighbor results

Pick a collection of \(25\) words \(w \in V\). For each \(w\), return its nearest neighbor \(w′\neq w\) in \(V\). A popular distance measure to use for this is cosine distance:

\[1−\frac{\phi(w)\phi(w′)}{\left\Vert\phi(w)\right\Vert\left\Vert\phi(w′)\right\Vert}\]
from sklearn.metrics.pairwise import cosine_similarity

S = 1 - cosine_similarity(X_new, X_new)
S_df = pd.DataFrame(S, index=sparse_table.index, columns=sparse_table.index)

np.fill_diagonal(S_df.values, 1)
S_df

Output:

V_words 0 1 10 100 1000 10000 11 12 13 14 ... youll young younger youngsters youre youth youve zen zero zg
V_words
0 1.000000 0.496496 0.612922 0.704027 0.699172 0.709779 0.758037 0.676296 0.783534 0.703629 ... 0.742794 0.753591 0.892090 0.733692 0.792452 0.849207 0.785617 0.843505 0.567424 0.702720
1 0.496496 1.000000 0.270032 0.407476 0.426005 0.598368 0.393145 0.335564 0.439774 0.364654 ... 0.725060 0.535398 0.794009 0.707147 0.796263 0.730820 0.776692 0.674079 0.570495 0.772490
10 0.612922 0.270032 1.000000 0.336588 0.410105 0.531437 0.251751 0.242511 0.308180 0.196244 ... 0.689861 0.578702 0.622325 0.658268 0.763801 0.776299 0.775649 0.812817 0.780041 0.923918
100 0.704027 0.407476 0.336588 1.000000 0.471953 0.529019 0.433712 0.423225 0.473367 0.443475 ... 0.665715 0.581621 0.759004 0.760093 0.755221 0.645551 0.737462 0.677804 0.749595 0.889030
1000 0.699172 0.426005 0.410105 0.471953 1.000000 0.694203 0.532895 0.573881 0.464923 0.510910 ... 0.695655 0.760756 0.832903 0.862806 0.841124 0.666632 0.854410 0.928525 0.830016 0.923236
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
youth 0.849207 0.730820 0.776299 0.645551 0.666632 0.811296 0.802807 0.837998 0.775515 0.782112 ... 0.603810 0.388828 0.612672 0.738526 0.614601 1.000000 0.632270 0.636307 0.708969 0.846091
youve 0.785617 0.776692 0.775649 0.737462 0.854410 0.749709 0.773328 0.804094 0.916708 0.787269 ... 0.419042 0.478772 0.599636 0.814881 0.294050 0.632270 1.000000 0.650591 0.714497 0.884929
zen 0.843505 0.674079 0.812817 0.677804 0.928525 0.815137 0.739136 0.813954 0.856319 0.872752 ... 0.774752 0.655871 0.794671 0.668088 0.783287 0.636307 0.650591 1.000000 0.780444 0.809703
zero 0.567424 0.570495 0.780041 0.749595 0.830016 0.770043 0.779864 0.574442 0.784010 0.758421 ... 0.762509 0.685187 0.872117 0.858662 0.745964 0.708969 0.714497 0.780444 1.000000 0.531597
zg 0.702720 0.772490 0.923918 0.889030 0.923236 0.829799 0.939233 0.771256 0.887662 0.925314 ... 0.812688 0.861458 0.917434 0.817756 0.890121 0.846091 0.884929 0.809703 0.531597 1.000000

5000 rows × 5000 columns

to_check = ['communism', 'autumn', 'cigarette', 'pulmonary', 'mankind', 
            'africa', 'chicago', 'revolution', 'september', 'chemical', 
            'detergent', 'dictionary', 'storm', 'worship']
S_dict = {}
for t in to_check:
    S_dict[t] = S_df[t].idxmin()
    
for d in S_dict:
    print (f'{d} ------> {S_dict[d]}')

Output:

communism ------> almost
autumn ------> dawn
cigarette ------> fingers
pulmonary ------> artery
mankind ------> nation
africa ------> western
chicago ------> club
revolution ------> movement
september ------> december
chemical ------> drugs
detergent ------> tubes
dictionary ------> text
storm ------> wedding
worship ------> organized

As we can see from above, some of the nearest neighbors do make sense, like cigarette -> smelled, africa -> asia, september -> december, dictionary -> text and storm -> summer.

all tags

Í