mongodb - Updating large number of records in a collection -

- August 15, 2011

i have collection called timesheet having few thousands records now. increase 300 million records in year. in collection embed few fields collection called department won't updates , records updated. mean once or twice in year , not records, less 1% of records in collection.

mostly once department created there won't update, if there update, done (when there not many related records in timesheet)

now if updates department after year, in worst case scenario there chances collection timesheet have 300 million records totally , 5 million matching records department gets updated. update query condition on index field.

since update time consuming , creates locks, i'm wondering there better way it? 1 option i'm thinking run update query in batches adding condition updateddatetime> somedate && updateddatetime < somedate.

other details:

a single document size 3 or 4 kb have replica set containing 3 replicas.

is there other better way this? think kind of design? think if there numbers given less below?

1) 100 million total records , 100,000 matching records update query

2) 10 million total records , 10,000 matching records update query

3) 1 million total records , 1000 matching records update query

note: collection names department , timesheet, , purpose fictional, not real collections statistics have given true.

let me give couple of hints based on global knowledge , experience:

use shorter field names

mongodb stores same key each document. repetition causes increased disk space. can have performance issue on huge database yours.

pros:

less size of documents, less disk space
more documennt fit in ram (more caching)
size of indexes less in scenario

cons:

less readable names

optimize on index size

the lesser index size is, more gets fit in ram , less index miss happens. consider sha1 hash git commits example. git commit many times represented first 5-6 characters. store 5-6 characters instead of hash.

understand padding factor

for updates happening in document causing costly document move. document move causing deleting old document , updating new empty location , updating indexes costly.

we need make sure document don't move if update happens. each collection there padding factor involved tells, during document insert, how space allocated apart actual document size.

you can see collection padding factor using:

db.collection.stats().paddingfactor

add padding manually

in case pretty sure start small document grow. updating document after while cause multiple document moves. better add padding document. unfortunately, there no easy way add padding. can adding random bytes key while doing insert , delete key in next update query.

finally, if sure keys come documents in future, preallocate keys default values further updates don't cause growth of document size causing document moves.

you can details query causing document move:

db.system.profile.find({ moved: { $exists : true } })

large number of collections vs large number of documents in few collection

schema depends on application requirements. if there huge collection in query latest n days of data, can optionally choose have separate collection , old data can safely archived. make sure caching in ram done properly.

every collection created incur cost more cost of creating collection. each of collection has minimum size few kbs + 1 index (8 kb). every collection has namespace associated, default have 24k namespaces. example, having collection per user bad choice since not scalable. after point mongo won't allow create new collections of indexes.

generally having many collections has no significant performance penalty. example, can choose have 1 collection per month, if know querying based on months.

denormalization of data

its recommended keep related data query or sequence of queries in same disk location. need duplicate information across different documents. example, in blog post, you'll want store post's comments within post document.

pros:

index size less number of index entries less
query fast includes fetching necessary details
document size comparable page size means when bring data in ram, of time not bringing other data along page
document move make sure freeing page, not small tiny chunk in page may not used in further inserts

capped collections

capped collection behave circular buffers. special type of fixed size collections. these collection can receive high speed writes , sequential reads. being fixed size, once allocated space filled, new documents written deleting older ones. document updates allowed if updated document fits original document size (play padding more flexibility).

Search This Blog

Share