This module implements the concept of Dictionary – a mapping between words and their integer ids.
Dictionaries can be created from a corpus and can later be pruned according to document frequency (removing (un)common words via the Dictionary.filter_extremes() method), save/loaded from disk (via Dictionary.save() and Dictionary.load() methods), merged with other dictionary (Dictionary.merge_with()) etc.
Dictionary encapsulates the mapping between normalized words and their integer ids.
The main function is doc2bow, which converts a collection of words to its bag-of-words representation: a list of (word_id, word_frequency) 2-tuples.
Build dictionary from a collection of documents. Each document is a list of tokens = tokenized and normalized utf-8 encoded strings.
This is only a convenience wrapper for calling doc2bow on each document with allow_update=True.
>>> print(Dictionary(["máma mele maso".split(), "ema má máma".split()]))
Dictionary(5 unique tokens)
Assign new word ids to all words.
This is done to make the ids more compact, e.g. after some tokens have been removed via filter_tokens() and there are gaps in the id series. Calling this method will remove the gaps.
Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized utf-8 encoded string. No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.
If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.
If allow_update is not set, this function is const, aka read-only.
Filter out tokens that appear in
After the pruning, shrink resulting gaps in word ids.
Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!
Remove the selected bad_ids tokens from all dictionary mappings, or, keep selected good_ids in the mapping and remove the rest.
bad_ids and good_ids are collections of word ids to be removed.
Create Dictionary from an existing corpus. This can be useful if you only have a term-document BOW matrix (represented by corpus), but not the original text corpus.
This will scan the term-document count matrix for all word ids that appear in it, then construct and return Dictionary which maps each word_id -> str(word_id).
Return a list of all token ids.
Load a previously saved object from file (also see save).
If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.
Load a previously stored Dictionary from a text file. Mirror function to save_as_text.
Merge another dictionary into this dictionary, mapping same tokens to the same ids and new tokens to new ids. The purpose is to merge two corpora created using two different dictionaries, one from self and one from other.
other can be any id=>word mapping (a dict, a Dictionary object, ...).
Return a transformation object which, when accessed as result[doc_from_other_corpus], will convert documents from a corpus built using the other dictionary into a document using the new, merged dictionary (see gensim.interfaces.TransformationABC).
Example:
>>> dict1 = Dictionary(some_documents)
>>> dict2 = Dictionary(other_documents) # ids not compatible with dict1!
>>> dict2_to_dict1 = dict1.merge_with(dict2)
>>> # now we can merge corpora from the two incompatible dictionaries into one
>>> merged_corpus = itertools.chain(some_corpus_from_dict1, dict2_to_dict1[some_corpus_from_dict2])
Save the object to file (also see load).
If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.
You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.
ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.
Save this Dictionary to a text file, in format: id[TAB]word_utf8[TAB]document frequency[NEWLINE].
Note: use save/load to store in binary format instead (pickle).