python - Pymongo and n-grams search -
i have collection of documents in mongo db. using pymongo access , insert collection. want :
in python, use map reduce efficiently query number of times n -gram phrase used across entire corpus.
i know how single words, struggling extend n-grams. dont want tokenize using nltk library , run map reduce. believe take efficiency out of solution. thanks.
if want efficient system, you'll need break down n-grams ahead of time , index them. when wrote 5-gram experiment (unfortunately backend offline had give hardware), i've created map of word => integer id
, , stored in mongodb hex id sequence in document key field of collection (for example, [10, 2] => "a:2"
). then, randomly distributing ~350 million 5-grams 10 machines running mongodb offered sub-second query times whole data set.
you can similar scheme. document such as:
{_id: "a:2", seen: [docid1, docid2, ...]}
you'll able find given n-gram found.
update: actually, small correction: in system went live ended using same scheme, encoding n-gram keys in binary format space efficiency (~350m lot of 5-grams!), otherwise mechanics same.
Comments
Post a Comment