This module implements the “hashing trick” – a mapping between words and their integer ids using a fixed, static mapping. The static mapping has a constant memory footprint, regardless of the number of word-types (features) in your corpus, so it’s suitable for processing extremely large corpora.
The ids are computed as hash(word) % id_range, where hash is a user-configurable function (adler32 by default). Using HashDictionary, new words can be represented immediately, without an extra pass through the corpus to collect all the ids first. This is another advantage: HashDictionary can be used with non-repeatable (once-only) streams of documents.
A disadvantage of HashDictionary is that, unline plain Dictionary, several words may map to the same id, causing hash collisions. The word<->id mapping is no longer a bijection.
HashDictionary encapsulates the mapping between normalized words and their integer ids.
Unlike Dictionary, building a HashDictionary before using it is not a necessary step. The documents can be computed immediately, from an uninitialized HashDictionary, without seeing the rest of the corpus first.
The main function is doc2bow, which converts a collection of words to its bag-of-words representation: a list of (word_id, word_frequency) 2-tuples.
By default, keep track of debug statistics and mappings. If you find yourself running out of memory (or are sure you don’t need the debug info), set debug=False.
Build dictionary from a collection of documents. Each document is a list of tokens = tokenized and normalized utf-8 encoded strings.
This is only a convenience wrapper for calling doc2bow on each document with allow_update=True.
Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized utf-8 encoded string. No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.
If allow_update or self.allow_update is set, then also update dictionary in the process: update overall corpus statistics and document frequencies. For each id appearing in this document, increase its document frequency (self.dfs) by one.
Remove document frequency statistics for tokens that appear in
Note: since HashDictionary’s id range is fixed and doesn’t depend on the number of tokens seen, this doesn’t really “remove” anything. It only clears some supplementary statistics, for easier debugging and a smaller RAM footprint.
v defaults to None.
Return a list of all token ids.
Load a previously saved object from file (also see save).
If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.
If key is not found, d is returned if given, otherwise KeyError is raised
2-tuple; but raise KeyError if D is empty.
Calculate id of the given token. Also keep track of what words were mapped to what ids, for debugging reasons.
Save the object to file (also see load).
If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.
You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.
ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.
Save this HashDictionary to a text file, for easier debugging.
The format is: id[TAB]document frequency of this id[TAB]tab-separated set of words in UTF8 that map to this id[NEWLINE].
Note: use save/load to store in binary format instead (pickle).
If E has a .keys() method, does: for k in E: D[k] = E[k] If E lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]