python - Pymongo and n-grams search -


i have collection of documents in mongo db. using pymongo access , insert collection. want :

in python, use map reduce efficiently query number of times n -gram phrase used across entire corpus.

i know how single words, struggling extend n-grams. dont want tokenize using nltk library , run map reduce. believe take efficiency out of solution. thanks.

if want efficient system, you'll need break down n-grams ahead of time , index them. when wrote 5-gram experiment (unfortunately backend offline had give hardware), i've created map of word => integer id, , stored in mongodb hex id sequence in document key field of collection (for example, [10, 2] => "a:2"). then, randomly distributing ~350 million 5-grams 10 machines running mongodb offered sub-second query times whole data set.

you can similar scheme. document such as:

{_id: "a:2", seen: [docid1, docid2, ...]} 

you'll able find given n-gram found.

update: actually, small correction: in system went live ended using same scheme, encoding n-gram keys in binary format space efficiency (~350m lot of 5-grams!), otherwise mechanics same.


Comments

Popular posts from this blog

java - Run a .jar on Heroku -

java - Jtable duplicate Rows -

validation - How to pass paramaters like unix into windows batch file -