The 0.7.x series of gensim was about improving performance and consolidating API. 0.8.x will be about new features — 0.8.1, first of the series, is a document similarity service.
The source code itself has been moved from gensim to its own, dedicated package, named simserver. Get it from PyPI or clone it on Github.
Conceptually, a service that lets you :
>>> from simserver import SessionServer
>>> server = SessionServer('/tmp/my_server') # resume server (or create a new one)
>>> server.train(training_corpus, method='lsi') # create a semantic model
>>> server.index(some_documents) # convert plain text to semantic representation and index it
>>> server.find_similar(query) # convert query to semantic representation and compare against index
>>> ...
>>> server.index(more_documents) # add to index: incremental indexing works
>>> server.find_similar(query)
>>> ...
>>> server.delete(ids_to_delete) # incremental deleting also works
>>> server.find_similar(query)
>>> ...
Note
“Semantic” here refers to semantics of the crude, statistical type – Latent Semantic Analysis, Latent Dirichlet Allocation etc. Nothing to do with the semantic web, manual resource tagging or detailed linguistic inference.
Digital libraries of (mostly) text documents. More generally, it helps you annotate, organize and navigate documents in a more abstract way, compared to plain keyword search.
The rest of this document serves as a tutorial explaining the features in more detail.
It is assumed you have gensim properly installed. You’ll also need the sqlitedict package that wraps Python’s sqlite3 module in a thread-safe manner:
$ sudo easy_install -U sqlitedict
To test the remote server capabilities, install Pyro4 (Python Remote Objects, at version 4.8 as of this writing):
$ sudo easy_install Pyro4
Note
Don’t forget to initialize logging to see logging messages:
>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
In case of text documents, the service expects:
>>> document = {'id': 'some_unique_string',
>>> 'tokens': ['content', 'of', 'the', 'document', '...'],
>>> 'other_fields_are_allowed_but_ignored': None}
This format was chosen because it coincides with plain JSON and is therefore easy to serialize and send over the wire, in almost any language. All strings involved must be utf8-encoded.
A sequence of documents. Anything that supports the for document in corpus: ... iterator protocol. Generators are ok. Plain lists are also ok (but consume more memory).
>>> from gensim import utils
>>> texts = ["Human machine interface for lab abc computer applications",
>>> "A survey of user opinion of computer system response time",
>>> "The EPS user interface management system",
>>> "System and human system engineering testing of EPS",
>>> "Relation of user perceived response time to error measurement",
>>> "The generation of random binary unordered trees",
>>> "The intersection graph of paths in trees",
>>> "Graph minors IV Widths of trees and well quasi ordering",
>>> "Graph minors A survey"]
>>> corpus = [{'id': 'doc_%i' % num, 'tokens': utils.simple_preprocess(text)}
>>> for num, text in enumerate(texts)]
Since corpora are allowed to be arbitrarily large, it is recommended client splits them into smaller chunks before uploading them to the server:
>>> utils.upload_chunked(server, corpus, chunksize=1000) # send 1k docs at a time
If you use the similarity service object (instance of simserver.SessionServer) in your code directly—no remote access—that’s perfectly fine. Using the service remotely, from a different process/machine, is an option, not a necessity.
Document similarity can also act as a long-running service, a daemon process on a separate machine. In that case, I’ll call the service object a server.
But let’s start with a local object. Open your favourite shell and:
>>> from gensim import utils
>>> from simserver import SessionServer
>>> service = SessionServer('/tmp/my_server/') # or wherever
That initialized a new service, located in /tmp/my_server (you need write access rights to that directory).
Note
The service is fully defined by the content of its location directory (“/tmp/my_server/”). If you use an existing location, the service object will resume from the index found there. Also, to “clone” a service, just copy that directory somewhere else. The copy will be a fully working duplicate of the original service.
We can start indexing right away:
>>> service.index(corpus)
AttributeError: must initialize model for /tmp/my_server/b before indexing documents
Oops, we can not. The service indexes documents in a semantic representation, which is different to the plain text we give it. We must teach the service how to convert between plain text and semantics first:
>>> service.train(corpus, method='lsi')
That was easy. The method=’lsi’ parameter meant that we trained a model for Latent Semantic Indexing and default dimensionality (400) over a tf-idf representation of our little corpus, all automatically. More on that later.
Note that for the semantic model to make sense, it should be trained on a corpus that is:
>>> service.index(corpus) # index the same documents that we trained on...
Indexing can happen over any documents, but I’m too lazy to create another example corpus, so we index the same 9 docs used for training.
Delete documents with:
>>> service.delete(['doc_5', 'doc_8']) # supply a list of document ids to be removed from the index
When you pass documents that have the same id as some already indexed document, the indexed document is overwritten by the new input (=only the latest counts; document ids are always unique per service):
>>> service.index(corpus[:3]) # overall index size unchanged (just 3 docs overwritten)
The index/delete/overwrite calls can be arbitrarily interspersed with queries. You don’t have to index all documents first to start querying, indexing can be incremental.
There are two types of queries:
by id:
>>> print(service.find_similar('doc_0'))
[('doc_0', 1.0, None), ('doc_2', 0.30426699, None), ('doc_1', 0.25648531, None), ('doc_3', 0.25480536, None)]
>>> print(service.find_similar('doc_5')) # we deleted doc_5 and doc_8, remember?
ValueError: document 'doc_5' not in index
In the resulting 3-tuples, doc_n is the document id we supplied during indexing, 0.30426699 is the similarity of doc_n to the query, but what’s up with that None, you ask? Well, you can associate each document with a “payload”, during indexing. This payload object (anything pickle-able) is later returned during querying. If you don’t specify doc[‘payload’] during indexing, queries simply return None in the result tuple, as in our example here.
or by document (using document[‘tokens’]; id is ignored in this case):
>>> doc = {'tokens': utils.simple_preprocess('Graph and minors and humans and trees.')}
>>> print(service.find_similar(doc, min_score=0.4, max_results=50))
[('doc_7', 0.93350589, None), ('doc_3', 0.42718196, None)]
So far, we did everything in our Python shell, locally. I very much like Pyro, a pure Python package for Remote Procedure Calls (RPC), so I’ll illustrate remote service access via Pyro. Pyro takes care of all the socket listening/request routing/data marshalling/thread spawning, so it saves us a lot of trouble.
To create a similarity server, we just create a simserver.SessionServer object and register it with a Pyro daemon for remote access. There is a small example script included with simserver, run it with:
$ python -m simserver.run_simserver /tmp/testserver
You can just ctrl+c to terminate the server, but leave it running for now.
Now open your Python shell again, in another terminal window or possibly on another machine, and:
>>> import Pyro4
>>> service = Pyro4.Proxy(Pyro4.locateNS().lookup('gensim.testserver'))
Now service is only a proxy object: every call is physically executed wherever you ran the run_server.py script, which can be a totally different computer (within a network broadcast domain), but you don’t even know:
>>> print(service.status())
>>> service.train(corpus)
>>> service.index(other_corpus)
>>> service.find_similar(query)
>>> ...
It is worth mentioning that Irmen, the author of Pyro, also released Pyrolite recently. That is a package which allows you to create Pyro proxies also from Java and .NET, in addition to Python. That way you can call remote methods from there too—the client doesn’t have to be in Python.
Ok, now it’s getting interesting. Since we can access the service remotely, what happens if multiple clients create proxies to it at the same time? What if they want to modify the server index at the same time?
Answer: the SessionServer object is thread-safe, so that when each client spawns a request thread via Pyro, they don’t step on each other’s toes.
This means that:
The service uses transactions internally. This means that each modification is done over a clone of the service. If the modification session fails for whatever reason (exception in code; power failure that turns off the server; client unhappy with how the session went), it can be rolled back. It also means other clients can continue querying the original index during index updates.
The mechanism is hidden from users by default through auto-committing (it was already happening in the examples above too), but auto-committing can be turned off explicitly:
>>> service.set_autosession(False)
>>> service.train(corpus)
RuntimeError: must open a session before modifying SessionServer
>>> service.open_session()
>>> service.train(corpus)
>>> service.index(corpus)
>>> service.delete(doc_ids)
>>> ...
None of these changes are visible to other clients, yet. Also, other clients’ calls to index/train/etc will block until this session is committed/rolled back—there cannot be two open sessions at the same time.
To end a session:
>>> service.rollback() # discard all changes since open_session()
or:
>>> service.commit() # make changes public; now other clients can see changes/acquire the modification lock