Objects of this class realize the transformation between word-document co-occurence matrix (integers) into a locally/globally weighted matrix (positive floats).
This is done by a log entropy normalization, optionally normalizing the resulting documents to unit length. The following formulas explain how to compute the log entropy weight for term i in document j:
local_weight_{i,j} = log(frequency_{i,j} + 1)
P_{i,j} = frequency_{i,j} / sum_j frequency_{i,j}
sum_j P_{i,j} * log(P_{i,j})
global_weight_i = 1 + ----------------------------
log(number_of_documents + 1)
final_weight_{i,j} = local_weight_{i,j} * global_weight_i
The main methods are:
a corpus.
log entropy normalized space.
>>> log_ent = LogEntropyModel(corpus)
>>> print = log_ent[some_doc]
>>> log_ent.save('/tmp/foo.log_ent_model')
Model persistency is achieved via its load/save methods.
normalize dictates whether the resulting vectors will be set to unit length.
Initialize internal statistics based on a training corpus. Called automatically from the constructor.
Load a previously saved object from file (also see save).
Save the object to file via pickling (also see load).