Blei’s LDA-C format.
Corpus in Blei’s LDA-C format.
The corpus is represented as two files: one describing the documents, and another describing the mapping between words and their ids.
Each document is one line:
N fieldId1:fieldValue1 fieldId2:fieldValue2 ... fieldIdN:fieldValueN
The vocabulary is a file with words, one word per line; word at line K has an implicit id=K.
Initialize the corpus from a file.
fname_vocab is the file with vocabulary; if not specified, it defaults to fname.vocab.
Return the document stored at file position offset.
Load a previously saved object from file (also see save).
If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.
Save a corpus in the LDA-C format.
There are actually two files saved: fname and fname.vocab, where fname.vocab is the vocabulary file.
This function is automatically called by BleiCorpus.serialize; don’t call it directly, call serialize instead.
Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).
This relies on the underlying corpus class serializer providing (in addition to standard iteration):
each saved document,
the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).
Example:
>>> MmCorpus.serialize('test.mm', corpus)
>>> mm = MmCorpus('test.mm') # `mm` document stream now has random access
>>> print(mm[42]) # retrieve document no. 42, etc.