python - Pymongo and n-grams search -

- July 15, 2015

i have collection of documents in mongo db. using pymongo access , insert collection. want :

in python, use map reduce efficiently query number of times n -gram phrase used across entire corpus.

i know how single words, struggling extend n-grams. dont want tokenize using nltk library , run map reduce. believe take efficiency out of solution. thanks.

if want efficient system, you'll need break down n-grams ahead of time , index them. when wrote 5-gram experiment (unfortunately backend offline had give hardware), i've created map of word => integer id, , stored in mongodb hex id sequence in document key field of collection (for example, [10, 2] => "a:2"). then, randomly distributing ~350 million 5-grams 10 machines running mongodb offered sub-second query times whole data set.

you can similar scheme. document such as:

{_id: "a:2", seen: [docid1, docid2, ...]}

you'll able find given n-gram found.

update: actually, small correction: in system went live ended using same scheme, encoding n-gram keys in binary format space efficiency (~350m lot of 5-grams!), otherwise mechanics same.

Search This Blog

Share

python - Pymongo and n-grams search -

Comments

Post a Comment

Popular posts from this blog

Line ending issue with Mercurial or Visual Studio -

python - Received unregistered task using Celery with Django -

php - Retrieving data submitted with Yii's CActiveForm -