Master of Data Science Candidate @ Univ of California San Diego, Data Engineer Intern @ Gotion, Email: huyuan17@gmail.com

Building a "no sweat" Word Embeddings

ml nlp introduction

Published on 03 Jul 2021

Introduction

The large number of English words can make language-based applications daunting. To cope with this, it is helpful to have a clustering or embedding of these words, so that words with similar meanings are clustered together, or have embeddings that are close to one another.

But how can we get at the meanings of words?

“You shall know a word by the company it keeps.”

John Firth (1957)

That is, words that tend to appear in similar contexts are likely to be related.

In this article, I will investigate this idea by coming up with a 100 dimensional embedding of words that is based on co-occurrence statistics.

The description here assumes you are using Python with NLTK.

import numpy as np
import pandas as pd

import nltk
nltk.download('brown')
from nltk.corpus import brown
from nltk.corpus import stopwords

import collections
from string import digits, punctuation

[nltk_data] Downloading package brown to
[nltk_data]     /Users/galaxie500/nltk_data...
[nltk_data]   Package brown is already up-to-date!

Prepare data

First, download the Brown corpus (using nltk.corpus). This is a collection of text samples from a wide range of sources, with a total of over a million words. Calling brown.words() returns this text in one long list, which is useful.

print(type(brown.words()))
words = list(brown.words())
print(f"Length of raw Brown corpus: {len(words)}")

<class 'nltk.corpus.reader.util.ConcatenatedCorpusView'>
Length of raw Brown corpus: 1161192

(a) Step-by-Step: build a 100-dimensional embeddings

Lowercase and Remove stopwords, punctuation

Remove stopwords and punctuation, make everything lowercase, and count how often each word occurs. Use this to come up with two lists:

A vocabulary V, consisting of a few thousand (e.g., 5000) of the most commonly-occurring words.
A shorter list C of at most 1000 of the most commonly-occurring words, which we shall call context words.

# clean text data
def preprocessing(corpus):
    for i in range(len(corpus)):
        corpus[i] = corpus[i].translate(str.maketrans('', '', punctuation))
        #corpus[i] = corpus[i].translate(str.maketrans('', '', digits))
        corpus[i] = corpus[i].lower()
    
    stop_words = set(stopwords.words())
    corpus = [w for w in corpus if w!='' if not w in stop_words] #remove empty string and stopwords
    return corpus

words = preprocessing(words)

Generate vocabulary V and shorter list C

def extract_frequent_words(word_list, freq=1000):
    count = collections.Counter(word_list)
    word_count = {'Word':list(count.keys()), 'Count':list(count.values())}
    df = pd.DataFrame(data=word_count)
    df = df.sort_values(by='Count', ascending=False)
    selected = df[0:freq]
    selected = selected.sort_index()
    return selected

C_df = extract_frequent_words(words, freq=1000)
C_list = C_df.Word.values

V_df = extract_frequent_words(words, freq=5000)
V_df.columns = ['V_words', 'window_Count']
V_list = V_df.V_words.values

For each word \(w \in V\), and each occurrence of it in the text stream, look at the surrounding window of four words (two before, two after):

\[w_1 \quad w_2 \quad w \quad w_3 \quad w_4\]

Keep count of how often context words from \(C\) appear in these positions around word \(w\). That is, for \(w \in V\), \(c \in C\), define:

\(n(w,c) =\) # of times c occurs in a window around \(w\)

Using these counts, construct the probability distribution \(Pr(c \vert w)\) of context words around \(w\)(for each \(w \in V\)), as well as the overall distribution Pr(c) of context words. These are distributions over \(C\).

Calculate \(Pr(c \vert w)\)

# track all the positions that word w showed up in the text stream
V_pos = [[x for x, n in enumerate(words) if n == w] for w in V_list]

Windows = []
context_word_count = []
for pos in V_pos:
    window = [] # context words surrounding w for w in V
    for i in pos:
        if i==0:
            cur_window = words[1:3]
        elif i==1:
            cur_window = [words[0]] + words[2:4]
        else:
            cur_window = words[i-2:i] + words[i+1:i+3]
        
        # exclude duplicate context words in a single window
        cur_unique = []
        for j in cur_window:
            if j not in cur_unique:
                cur_unique.append(j)
        
        window += cur_unique
    
    # count occurrence of each context word for words in window w
    context_word_count += list(collections.Counter(window).values())
    
    # remove duplicated context words
    context_word_unique = []
    for k in window:
        if k not in context_word_unique:
            context_word_unique.append(k)
            
    Windows.append(context_word_unique)

V_df['context_words'] = Windows
V_df = V_df.explode('context_words')
V_df['countext_word_count'] = context_word_count
V_df['Pr(c|w)'] = V_df['countext_word_count'].astype('float')/V_df['window_Count']
V_df['in_C'] = [1 if i in C_list else 0 for i in V_df.context_words.values]

# drop all the rows with context words that are not in our shorter list C
V_df = V_df[V_df.in_C==1]

# drop boolean column 'in_C' after we collect all context words belonged to C
V_df.drop(columns='in_C')

Output:

	V_words	window_Count	context_words	countext_word_count	Pr(c\|w)
1	county	155	general	1	0.006452
1	county	155	none	1	0.006452
1	county	155	future	1	0.006452
1	county	155	doctor	1	0.006452
1	county	155	cent	1	0.006452
...	...	...	...	...	...
47101	letch	19	person	1	0.052632
47101	letch	19	took	1	0.052632
47101	letch	19	energy	1	0.052632
47101	letch	19	said	1	0.052632
47101	letch	19	interested	1	0.052632

461410 rows × 5 columns

Calculate Pr(c)

# calculate distribution of context words in C
C_list_uniq = list(V_df.context_words.unique())
Pr_C = {}
for c in C_list_uniq:
    Pr_C[c] = words.count(c) / len(words)

V_df['Pr(c)'] = V_df.loc[:, 'context_words'].apply(lambda x: Pr_C[x])
V_df

Output:

	V_words	window_Count	context_words	countext_word_count	Pr(c\|w)	in_C	Pr(c)
1	county	155	general	1	0.006452	1	0.000950
1	county	155	none	1	0.006452	1	0.000206
1	county	155	future	1	0.006452	1	0.000433
1	county	155	doctor	1	0.006452	1	0.000191
1	county	155	cent	1	0.006452	1	0.000296
...	...	...	...	...	...	...	...
47101	letch	19	person	1	0.052632	1	0.000332
47101	letch	19	took	1	0.052632	1	0.000813
47101	letch	19	energy	1	0.052632	1	0.000191
47101	letch	19	said	1	0.052632	1	0.003741
47101	letch	19	interested	1	0.052632	1	0.000200

461410 rows × 7 columns

Calculate pointwise mutual information

Represent each vocabulary item \(w\) by a

-dimensional vector \(\phi(w)\), whose \(c\)’th coordinate is:

\[\phi_c(w) = max(0, \frac{logPr(c|w)}{Pr(c)})\]

This is known as the (positive) pointwise mutual information, and has been quite successful in work on word embeddings.

import math 

def pointwise_mutual_info(row):
    a = row['Pr(c|w)']
    b = row['Pr(c)']
    res = math.log(a/b)
    return max(0, res)

V_df['Phi'] = V_df.apply(pointwise_mutual_info, axis =1)
V_df

Output:

	V_words	window_Count	context_words	countext_word_count	Pr(c\|w)	in_C	Pr(c)	Phi
1	county	155	general	1	0.006452	1	0.000950	1.915628
1	county	155	none	1	0.006452	1	0.000206	3.444097
1	county	155	future	1	0.006452	1	0.000433	2.701278
1	county	155	doctor	1	0.006452	1	0.000191	3.521058
1	county	155	cent	1	0.006452	1	0.000296	3.082803
...	...	...	...	...	...	...	...	...
47101	letch	19	person	1	0.052632	1	0.000332	5.066159
47101	letch	19	took	1	0.052632	1	0.000813	4.170775
47101	letch	19	energy	1	0.052632	1	0.000191	5.620044
47101	letch	19	said	1	0.052632	1	0.003741	2.644005
47101	letch	19	interested	1	0.052632	1	0.000200	5.571254

461410 rows × 8 columns

Dimension Reduction

Before conducting dimension reduction, we have to create a sparse matrix with rows as V_words and columns as context words, the value will be the pointwise mutual information.

sparse_table = pd.pivot_table(V_df, index = 'V_words', columns = 'context_words', values = 'Phi')
sparse_table

Output:

context_words	1	10	100	12	15	1959	1960	1961	2	20	...	written	wrong	wrote	year	years	yes	yet	york	young	youre
V_words
0	5.973217	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4.629721	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	3.965613	3.781690	3.340674	1.984232	3.333348	NaN	3.328978	3.355489	4.750349	2.213074	...	1.830082	NaN	NaN	2.323740	0.651427	NaN	NaN	1.156607	0.913791	NaN
10	3.648159	NaN	4.766189	4.731503	4.471181	NaN	3.550520	NaN	NaN	4.449519	...	2.967915	NaN	2.80637	2.614275	3.804163	NaN	NaN	2.294440	2.051624	NaN
100	3.340674	4.766189	NaN	NaN	NaN	3.919708	3.397186	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	2.734538	NaN	NaN	NaN	NaN	NaN
1000	4.697981	NaN	4.989332	NaN	4.694324	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	4.223713	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
youth	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.502838	NaN
youve	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	3.928008	4.105148	NaN	NaN	2.056206	NaN	2.92709	NaN	NaN	3.947681
zen	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
zero	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
zg	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5000 rows × 1000 columns

Here, we will be doing Singular Value Decomposition(SVD) to decompose the above sparse matrix to 100 dimension word representation.

# fill NaN with 0s
sparse_table = sparse_table.fillna(0)

from sklearn.decomposition import TruncatedSVD
#from sklearn.random_projection import sparse_random_matrix

X = np.asarray(sparse_table)

svd = TruncatedSVD(n_components = 100, n_iter = 5, random_state = 42)
svd.fit(X)

X_new = svd.fit_transform(X)
df_new = pd.DataFrame(X_new, index = sparse_table.index)
df_new

Output:

	0	1	2	3	4	5	6	7	8	9	...	90	91	92	93	94	95	96	97	98	99
V_words
0	5.542319	-1.995680	1.381643	-2.909863	-3.406599	-1.401489	-3.144874	0.091159	0.404984	-2.519494	...	-0.272699	0.328477	0.218683	-0.656883	-0.016048	-1.542390	1.702155	-0.510453	-1.331743	0.531266
1	23.523418	-12.896316	11.089762	-3.408536	-9.895360	-3.761695	0.060779	7.032133	-1.337786	-3.057331	...	-1.908982	1.097818	-1.488350	-1.397893	1.135940	-0.119767	0.577449	-0.072922	1.941455	1.521440
10	17.485695	-4.392681	13.806938	-0.205896	-9.351523	-3.323489	2.553425	-0.650727	4.434021	-3.258592	...	-1.889507	-1.311489	0.508911	-1.792407	0.841425	1.114498	0.939590	1.554220	-1.658337	1.261531
100	12.115965	-3.480162	7.104189	-1.246714	-4.169442	-0.304919	1.910621	-2.503645	2.942702	-1.915575	...	1.479620	0.117896	-0.171019	1.062107	1.123733	-0.680444	-1.128887	-1.518986	1.168567	0.436447
1000	6.748229	-3.263267	3.999501	-1.575692	-4.276787	-0.313027	1.567308	-1.057089	2.655190	-2.324639	...	0.150966	1.584698	1.008214	1.805991	-1.644718	0.632643	0.848221	-0.695705	0.090873	-0.922286
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
youth	13.327412	0.517403	-2.878062	-0.924167	2.294374	-0.957506	-0.091582	-3.612247	-1.984099	-1.501401	...	0.115512	0.774741	1.203958	3.243366	-1.524132	-0.470964	-1.921117	1.201018	2.303047	-0.377757
youve	13.042776	8.514331	-6.500516	2.708451	-5.992631	1.492063	0.820251	-0.925629	0.217164	-1.467112	...	-0.712831	-1.747105	-1.294000	0.663011	-0.229572	0.423359	-1.324096	-0.035515	-0.995549	-0.576969
zen	4.458987	-1.740182	-1.933253	-1.843287	0.210213	-1.749685	-0.096430	-1.115299	-0.678715	1.249615	...	-1.146979	-0.144040	-0.256134	-0.293627	0.389218	0.439939	-0.344095	-0.105342	0.127606	-1.844329
zero	6.126870	0.258309	0.348387	-3.972732	-0.751711	-1.251426	-2.420377	1.122941	0.488333	-0.156903	...	-0.210953	0.934246	-1.952539	1.080573	0.198827	-0.895675	-1.193648	-0.156535	-0.424499	1.452921
zg	3.466457	-0.743264	-0.012059	-1.887242	-0.635641	-1.200829	-3.495046	1.720301	0.282265	0.022144	...	0.220086	-0.863757	0.438066	1.188732	0.514399	0.546616	-0.517487	0.494538	0.355671	-0.530369

5000 rows × 100 columns

(b) Nearest neighbor results

Pick a collection of \(25\) words \(w \in V\). For each \(w\), return its nearest neighbor \(w′\neq w\) in \(V\). A popular distance measure to use for this is cosine distance:

\[1−\frac{\phi(w)\phi(w′)}{\left\Vert\phi(w)\right\Vert\left\Vert\phi(w′)\right\Vert}\]

from sklearn.metrics.pairwise import cosine_similarity

S = 1 - cosine_similarity(X_new, X_new)
S_df = pd.DataFrame(S, index=sparse_table.index, columns=sparse_table.index)

np.fill_diagonal(S_df.values, 1)
S_df

Output:

V_words	0	1	10	100	1000	10000	11	12	13	14	...	youll	young	younger	youngsters	youre	youth	youve	zen	zero	zg
V_words
0	1.000000	0.496496	0.612922	0.704027	0.699172	0.709779	0.758037	0.676296	0.783534	0.703629	...	0.742794	0.753591	0.892090	0.733692	0.792452	0.849207	0.785617	0.843505	0.567424	0.702720
1	0.496496	1.000000	0.270032	0.407476	0.426005	0.598368	0.393145	0.335564	0.439774	0.364654	...	0.725060	0.535398	0.794009	0.707147	0.796263	0.730820	0.776692	0.674079	0.570495	0.772490
10	0.612922	0.270032	1.000000	0.336588	0.410105	0.531437	0.251751	0.242511	0.308180	0.196244	...	0.689861	0.578702	0.622325	0.658268	0.763801	0.776299	0.775649	0.812817	0.780041	0.923918
100	0.704027	0.407476	0.336588	1.000000	0.471953	0.529019	0.433712	0.423225	0.473367	0.443475	...	0.665715	0.581621	0.759004	0.760093	0.755221	0.645551	0.737462	0.677804	0.749595	0.889030
1000	0.699172	0.426005	0.410105	0.471953	1.000000	0.694203	0.532895	0.573881	0.464923	0.510910	...	0.695655	0.760756	0.832903	0.862806	0.841124	0.666632	0.854410	0.928525	0.830016	0.923236
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
youth	0.849207	0.730820	0.776299	0.645551	0.666632	0.811296	0.802807	0.837998	0.775515	0.782112	...	0.603810	0.388828	0.612672	0.738526	0.614601	1.000000	0.632270	0.636307	0.708969	0.846091
youve	0.785617	0.776692	0.775649	0.737462	0.854410	0.749709	0.773328	0.804094	0.916708	0.787269	...	0.419042	0.478772	0.599636	0.814881	0.294050	0.632270	1.000000	0.650591	0.714497	0.884929
zen	0.843505	0.674079	0.812817	0.677804	0.928525	0.815137	0.739136	0.813954	0.856319	0.872752	...	0.774752	0.655871	0.794671	0.668088	0.783287	0.636307	0.650591	1.000000	0.780444	0.809703
zero	0.567424	0.570495	0.780041	0.749595	0.830016	0.770043	0.779864	0.574442	0.784010	0.758421	...	0.762509	0.685187	0.872117	0.858662	0.745964	0.708969	0.714497	0.780444	1.000000	0.531597
zg	0.702720	0.772490	0.923918	0.889030	0.923236	0.829799	0.939233	0.771256	0.887662	0.925314	...	0.812688	0.861458	0.917434	0.817756	0.890121	0.846091	0.884929	0.809703	0.531597	1.000000

5000 rows × 5000 columns

to_check = ['communism', 'autumn', 'cigarette', 'pulmonary', 'mankind', 
            'africa', 'chicago', 'revolution', 'september', 'chemical', 
            'detergent', 'dictionary', 'storm', 'worship']

S_dict = {}
for t in to_check:
    S_dict[t] = S_df[t].idxmin()
    
for d in S_dict:
    print (f'{d} ------> {S_dict[d]}')

Output:

communism ------> almost
autumn ------> dawn
cigarette ------> fingers
pulmonary ------> artery
mankind ------> nation
africa ------> western
chicago ------> club
revolution ------> movement
september ------> december
chemical ------> drugs
detergent ------> tubes
dictionary ------> text
storm ------> wedding
worship ------> organized

As we can see from above, some of the nearest neighbors do make sense, like cigarette -> smelled, africa -> asia, september -> december, dictionary -> text and storm -> summer.

Building a "no sweat" Word Embeddings

Introduction

(a) Step-by-Step: build a 100-dimensional embeddings

Before conducting dimension reduction, we have to create a sparse matrix with rows as V_words and columns as context words, the value will be the pointwise mutual information.

Here, we will be doing Singular Value Decomposition(SVD) to decompose the above sparse matrix to 100 dimension word representation.

(b) Nearest neighbor results

related posts

Monitoring Airbnb reviews over COVID-19 with folium HeatMapWithTime

Solving Logistic Regression with Coordinate Descent in Three ways

Stretching out Support Vector Machine with Calibrated Estimation

all tags