At the start, word embeddings algorithms were complex but then they got simpler and simpler.
Neural language model:
np.dot(
E,o<sub>j</sub>)
W1
and b1
while softmax layer has parameters W2
and b2
E
matrix and layers parameters. We need to maximize the likelihood to predict the next word given the context (previous words).In the last example we took a window of 6 words that fall behind the word that we want to predict. There are other choices when we are trying to learn word embeddings.
Researchers found that if you really want to build a language model, it’s natural to use the last few words as a context. But if your main goal is really to learn a word embedding, then you can use all of these other contexts and they will result in very meaningful work embeddings as well.
To summarize, the language modeling problem poses a machines learning problem where you input the context (like the last four words) and predict some target words. And posing that problem allows you to learn good word embeddings.