Module for Latent Semantic Analysis (aka Latent Semantic Indexing) in Python.
Implements scalable truncated Singular Value Decomposition in Python. The SVD decomposition can be updated with new observations at any time (online, incremental, memory-efficient training).
This module actually contains several algorithms for decomposition of large corpora, a combination of which effectively and transparently allows building LSI models for:
Wall-clock performance on the English Wikipedia (2G corpus positions, 3.2M documents, 100K features, 0.5G non-zero entries in the final TF-IDF matrix), requesting the top 400 LSI factors:
algorithm | serial | distributed |
---|---|---|
one-pass merge algorithm | 5h14m | 1h41m |
multi-pass stochastic algo (with 2 power iterations) | 5h39m | N/A [1] |
serial = Core 2 Duo MacBook Pro 2.53Ghz, 4GB RAM, libVec
distributed = cluster of four logical nodes on three physical machines, each with dual core Xeon 2.0GHz, 4GB RAM, ATLAS
[1] | The stochastic algo could be distributed too, but most time is already spent reading/decompressing the input from disk in its 4 passes. The extra network traffic due to data distribution across cluster nodes would likely make it slower. |
Objects of this class allow building and maintaining a model for Latent Semantic Indexing (also known as Latent Semantic Analysis).
The main methods are:
The left singular vectors are stored in lsi.projection.u, singular values in lsi.projection.s. Right singular vectors can be reconstructed from the output of lsi[training_corpus], if needed. See also FAQ [2].
Model persistency is achieved via its load/save methods.
[2] | https://github.com/piskvorky/gensim/wiki/Recipes-&-FAQ#q4-how-do-you-output-the-u-s-vt-matrices-of-lsi |
num_topics is the number of requested factors (latent dimensions).
After the model has been trained, you can estimate topics for an arbitrary, unseen document, using the topics = self[document] dictionary notation. You can also add new training documents, with self.add_documents, so that training can be stopped and resumed at any time, and the LSI transformation is available at any point.
If you specify a corpus, it will be used to train the model. See the method add_documents for a description of the chunksize and decay parameters.
Turn onepass off to force a multi-pass stochastic algorithm.
power_iters and extra_samples affect the accuracy of the stochastic multi-pass algorithm, which is used either internally (onepass=True) or as the front-end algorithm (onepass=False). Increasing the number of power iterations improves accuracy, but lowers performance. See [3] for some hard numbers.
Turn on distributed to enable distributed computing.
Example:
>>> lsi = LsiModel(corpus, num_topics=10)
>>> print(lsi[doc_tfidf]) # project some document into LSI space
>>> lsi.add_documents(corpus2) # update LSI on additional documents
>>> print(lsi[doc_tfidf])
[3] | http://nlp.fi.muni.cz/~xrehurek/nips/rehurek_nips.pdf |
Update singular value decomposition to take into account a new corpus of documents.
Training proceeds in chunks of chunksize documents at a time. The size of chunksize is a tradeoff between increased speed (bigger chunksize) vs. lower memory footprint (smaller chunksize). If the distributed mode is on, each chunk is sent to a different worker/computer.
Setting decay < 1.0 causes re-orientation towards new data trends in the input document stream, by giving less emphasis to old observations. This allows LSA to gradually “forget” old observations (documents) and give more preference to new ones.
Load a previously saved object from file (also see save).
Large arrays are mmap’ed back as read-only (shared memory).
Print (to log) the most salient words of the first num_topics topics.
Unlike print_topics(), this looks for words that are significant for a particular topic and not for others. This should result in a more human-interpretable description of topics.
Alias for show_topics() which prints the top 5 topics to log.
Save the model to file.
Large internal arrays may be stored into separate files, with fname as prefix.
Return a specified topic (=left singular vector), 0 <= topicno < self.num_topics, as string.
Return only the topn words which contribute the most to the direction of the topic (both negative and positive).
>>> lsimodel.print_topic(10, topn=5)
'-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + -0.174 * "functor" + -0.168 * "operator"'
Show num_topics most significant topics (show all by default). For each topic, show num_words most significant words (10 words by defaults).
Return the shown topics as a list – a list of strings if formatted is True, or a list of (value, word) 2-tuples if it’s False.
If log is True, also output this result to log.
Given eigenvalues s, return how many factors should be kept to avoid storing spurious (tiny, numerically instable) values.
This will ignore the tail of the spectrum with relative combined mass < min(discard, 1/k).
The returned value is clipped against k (= never return more than k).
Run truncated Singular Value Decomposition (SVD) on a sparse input.
Return (U, S): the left singular vectors and the singular values of the input data stream corpus [4]. The corpus may be larger than RAM (iterator of vectors).
This may return less than the requested number of top rank factors, in case the input itself is of lower rank. The extra_dims (oversampling) and especially power_iters (power iterations) parameters affect accuracy of the decomposition.
This algorithm uses 2+power_iters passes over the input data. In case you can only afford a single pass, set onepass=True in LsiModel and avoid using this function directly.
The decomposition algorithm is based on Halko, Martinsson, Tropp. Finding structure with randomness, 2009.
[4] | If corpus is a scipy.sparse matrix instead, it is assumed the whole corpus fits into core memory and a different (more efficient) code path is chosen. |