Noise Contrastive Estimation and Negative Sampling

Noise Contrastive Estimaion (NCE) is a parameter estimation technique used to model a data distribution by learning a classifier that distinguishes data from the true distribution vs a noise distribution. It avoids calculation of partition functions and its derivatives at each training step that is a very computationally demanding step in many techniques used to model data distribution.

1. Definitions

Before delving further into details of NCE, it is important that we define the following terms and understand the notation used in this paper:

Empirical Distribution: Empirical distributions, referred in the paper as \(\hat{p}\), is distribution of observed data of a p.d.f. Unlike the pdf of the distribution which is observed , the probability of each observation in an empirical distribution is \(\frac{1}{m}\), for i.i.d. distribution where m is the number of observed samples. Therefore,

\[\displaystyle E_{\hat{p} }(X) = \frac{1}{m} \sum_1^n X_i\]

Noise Distribution: Noise distributtion is a distribution that is different from the empirical distribution that is being learned. In this paper a noise disribution is referred to as \(q\).

Model Distribution: A model finds a parameter \(\theta\) that approximates the empirical distribution as closely as possible. The data generated by such a model is called model distribution. Model distribution is referred to as \(p_{\theta}\) in this paper.

2 Noise Contrastive Estimation

Noise contrastive estimation is used in WordToVEC and GraphSage algorithms which learn low dimensional vector representation of words and nodes respectively,using unlabeled data for the purpose of node classification or link prediction. The vectors of the words or nodes, thus learned, are more similar between related ones than unrelated ones. Here is a brief overview of how these algorithms convert this unsupervised learning problem to a supervised one and what role NCE plays in it:

  1. The problem is initially formulated as a multi-class classification problem which predicts whether an entity is suitable to appear in context of another in an empirical distribution. In WordToVEC algorithm it ditermines whether a word appears in context of another word and in GraphSage to ditermine if a node is in neighborhood of another in the training set.

  2. A new dataset is created from the orignal one, which contains multiple pairs of a context entity c and another entity e which is predicted to appear in the context of c. A training sample \(( c_i, e_{ij} )\) generated in such a manner is labeled 1 if \(e_{ij}\) appears in context of \(c_i\) and 0 otherwise. This now becomes a supervised multi-class classification problem, wherin the output layer of the model predicts the probbaility of each entity in the vocabulary being in context of c. The output layer therefore uses a softmax function:
    \(\displaystyle p_{\theta}( e|c ) = \frac{s_{\theta}(e,c)}{\sum_{e' \in V} s_{\theta}(e',c)}\)
    where V is the set of all possible entities called ‘vocabulary’ and \(s_{\theta}(e',c)\) is the output of the final neural network layer for entity e’.

  3. The problem in the above equation is that if |V| is very large, each training iteration would require a large number of weights to be updated, making the problem computationally too expensive. This is where NCE comes into picture.

  4. NCE transforms this multi-class classification problem to a small number of binary classification problems and hence drastically reduces the number of weights that are required to be updated for each learning iteration. What these means is that in-effect, the last softmax layer is replaced by a single logistic regression (sigmoid) unit which just predicts whether the training sample is created from the empirical distribution or from noise distribution. For each context word, only k + 1 binary classification problems are requried to be solved as part of the training.

2.1 Mathematical foundation

In this section we formulate the loss function for NCE and find values of it’s contituents.

Formulating the loss function

As discussed, NCE converts a multi-class classication problem to small number of binary classification problems. It’s loss function can therefore be defined as:

\(L_{nce_k} =\) Log likelihood of positive training samples + Log Likelihood of negative training samples

For NCE, for any given context \(c_i\), we take one postive training sample from the empirical distribution s.t. \((c_i, e_i )\) is labeled 1 and k negative samples s.t. \((c_i, e_j ), s.t. j \in {1,2,...,K} , e_j\) sampled from q are labeled 0 .

Let \(L_{neg}\) be the log likelihood of the negative training samples:
\(\displaystyle L_{neg} = \sum_{j=1}^k log \ p( D = 0 | e_j,c_i )\)

Multiplying and dividing by k:
\(\displaystyle \displaystyle L_{neg} = k * \frac{1}{k} \sum_{j=1}^k log \ p( D = 0 | e_j,c_i )\)

Bringing \(\frac{1}{k}\) inside summation:
\(\displaystyle L_{neg} = k * \sum_{j=1}^k \frac{1}{k} log \ p( D = 0 | e_j,c_i )\)

\(\frac{1}{k}\) is the probability of each training sample , from a total of k, sampled from noise distribution q.
\(\displaystyle L_{neg} = k * \sum_{j=1}^k q(e_j) log \ p( D = 0 | e_j,c_i )\)

By definition of expectation:
\(\displaystyle L_{neg} = k * E_{e_j \sim q} log \ p( D = 0 | e_j,c_i )\)

Summing the log likelihoods of positive and negative training samples:
\(L_{nce_k} = log \ p( D = 1 \mid e_i,c_i ) + k * E_{e_j \sim q} log \ p( D = 0 \mid e_j,c_i )\)

\(E_{e_j \sim q} log \ p( D = 0 \mid e_j,c_i )\) is approximated using Monte-Carlo approximation:
\(\displaystyle L_{nce_k}^{MC} = log \ p( D = 1 \mid e_i,c_i ) + k * \frac{1}{k} \sum_{j=1, e_j ~ q }^k log \ p( D = 0 \mid e_j,c_i )\)
\(\displaystyle L_{nce_k}^{MC} = log \ p( D = 1 \mid e_i,c_i ) + \sum_{j=1, e_j \sim q }^k log \ p( D = 0 \mid e_j,c_i )\) … (1)

Find \(p( D = 1\mid e_i,c_i ) \ and \ p( D = 0 \mid e_j,c_i )\)

Owing to how the positive and negative samples are created, we have:
\(p(d ,e' \mid c) = \frac{k}{k+1} * q (e')\), for d = 0 … ( 1 ).
\(p(d ,e \mid c) = \frac{1}{k+1} * p\hat (e \mid c )\) for d = 1 … ( 2 ).

By definition of conditional probability:
\(\displaystyle p(d ,e \mid c) = \frac { p(d, e, c)}{ p(c)}\)

Mulitiplying and dividing by p(e,c)
\(\displaystyle p(d ,e \mid c) = \frac { p(d, e, c) * p(e,c)}{ p(c) * p(e,c)}\)
\(\displaystyle p(d ,e \mid c) = \frac{\frac { p(d, e, c)}{p(e,c)} }{\frac{ p(c)}{p(e,c)} }\)
\(\displaystyle p(d ,e \mid c) = p(d \mid e, c) * p(e \mid c)\)
\(\displaystyle p(d \mid e, c) = \frac{p(d, e \mid c)}{ p(e \mid c) }\)

The denominator on RHS is the probability of the numerator marginalized over d.

Replacing equation (1) and (2) in the formula of \(p(d \mid e, c)\) and replacing \(\hat{p} (e \mid c )\) with \(s_{\theta}(e,c)\) where \(s_{\theta}(e,c)\) is a score function obtained from the model, we have:

\[\displaystyle p(d = 0 \mid e', c) = \frac{k* q(e)}{s_{\theta}(e',c) + k * q(e) }\] \[\displaystyle p(d = 1 \mid e, c) = \frac{s_{\theta}(e,c)}{s_{\theta}(e,c) + k * q(e) }\]

These values can be plugged in the equation for \(L_{nce_k}^{MC}\) to get the value of the loss function.

3. Negative Sampling.

Negative sampling is a variation of NCE which defines conditional probabilities as:

\[\displaystyle p(d = 0 \mid e', c) = \frac{1}{s_{\theta}(e,c) + 1}\] \[\displaystyle p(d = 1 \mid e, c) = \frac{s_{\theta}(e,c)}{s_{\theta}(e,c) +1 }\]

This objective can be viewed as NCE where \(k = \mid V \mid\) and q is uniform ( i.e. with probability \(\frac{1}{ \mid V \mid }\) making k* q(e) = 1. Hence, for values of \(k < \mid V \mid\) , negative sampling might be suitable but does not provide asymptotic consistency guarantees that NCE has. A small value of k is chosen for large data sets, roughly around 2 to 5. While for smaller data sets, a relatively larger value is preferred, around 5 to 20.

4. Example: Word2Vec model

It might get easy to get lost in all the math. So in this section we explore how a keras model can be developed to implement a Word2Vec model that learns word embeddings using negative sampling.

embedding_layer = layers.Embedding(N_WORDS, EMBEDDING_DIM, 
                                   embeddings_initializer="RandomNormal",
                                   input_shape=(2,))

model = keras.Sequential([
  embedding_layer,
  layers.GlobalAveragePooling1D(),
  layers.Dense(1, activation='sigmoid'),
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In the code sample above :

  • The embedding layer is provided one-hot-encoding of one postitive sample of context word and target word pair, \(( c_i, w_i )\) and k negative samples \(( c_i, w_j )\) , j = 1 … k
  • GlobalAveragePooling1D reshapes the output of the embedding layer by collapsing the context word and target words embedding.
  • The last Dense layer, outputs \(P(D \mid c_i, w_j )\) after sigmoid activation.
  • The total binary-cross-entropy loss for the k+1 samples provided in training for each context word \(c_i\) , can be seen as same as the loss function derived in equation-1 in previous section.

5. References

[1]. A Gentle Introduction to Information Entropy [2]. Information Gain and Mutual Information for Machine Learning

Written on January 15, 2022