Module for deep learning via hierarchical softmax skip-gram from [1]. The training algorithm was originally ported from the C package https://code.google.com/p/word2vec/ and extended with additional functionality.
Install Cython with `pip install cython` before to use optimized word2vec training (70x speedup [2]).
Initialize a model with e.g.:
>>> model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
Persist a model to disk with:
>>> model.save(fname)
>>> model = Word2Vec.load(fname) # you can continue training with the loaded model!
The model can also be instantiated from an existing file on disk in the word2vec C format:
>>> model = Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False) # C text format
>>> model = Word2Vec.load_word2vec_format('/tmp/vectors.bin', binary=True) # C binary format
You can perform various syntactic/semantic NLP word tasks with the model. Some of them are already built-in:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
>>> model['computer'] # raw numpy vector of a word
array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
and so on.
[1] | Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. |
[2] | Optimizing word2vec in gensim, http://radimrehurek.com/2013/09/word2vec-in-python-part-two-optimizing/ |
Iterate over sentences from the Brown corpus (part of NLTK data).
Iterate over sentences from the “text8” corpus, unzipped from http://mattmahoney.net/dc/text8.zip .
A single vocabulary item, used internally for constructing binary trees (incl. both word leaves and inner nodes).
Class for training, using and evaluating neural networks described in https://code.google.com/p/word2vec/
The model can be stored/loaded via its save() and load() methods, or stored/loaded in a format compatible with the original word2vec implementation via save_word2vec_format() and load_word2vec_format().
Initialize the model from an iterable of sentences. Each sentence is a list of words (utf8 strings) that will be used for training. See BrownCorpus in this module for an example.
If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.
size is the dimensionality of the feature vectors. window is the maximum distance between the current and predicted word within a sentence. alpha is the initial learning rate (will linearly drop to zero as training progresses). seed = for the random number generator. min_count = ignore all words with total frequency lower than this. workers = use this many worker threads to train the model (=faster training with multicore machines)
Compute accuracy of the model. questions is a filename where lines are 4-tuples of words, split into sections by ”: SECTION NAME” lines. See https://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt for an example.
The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there’s one aggregate summary at the end.
Use restrict_vocab to ignore all questions containing a word whose frequency is not in the top-N most frequent words (default top 30,000).
This method corresponds to the compute-accuracy script of the original C word2vec.
Build vocabulary from a sequence of sentences (can be a once-only generator stream). Each sentence must be a list of utf8 strings.
Create a binary Huffman tree using stored vocabulary word counts. Frequent words will have shorter binary codes. Called internally from build_vocab().
Which word from the given list doesn’t go with the others?
Example:
>>> trained_model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
Load a previously saved object from file (also see save).
Load the input-hidden weight matrix from the original C word2vec-tool format.
Note that the information loaded is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.
Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively.
This method computes cosine similarity between a simple mean of the projection weight vectors of the given words, and corresponds to the word-analogy and distance scripts in the original word2vec implementation.
Example:
>>> trained_model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
Reset all projection weights to an initial (untrained) state, but keep the existing vocabulary.
Save the object to file via pickling (also see load).
Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.
Compute cosine similarity between two words.
Example:
>>> trained_model.similarity('woman', 'man')
0.73723527
>>> trained_model.similarity('woman', 'woman')
1.0
Update the model’s neural weights from a sequence of sentences (can be a once-only generator stream). Each sentence must be a list of utf8 strings.