python - Is there another way to avoid duplication of large hashable objects? -
i processing text , have need store large sequences of hashable objects - strings, tuples of words, etc. i've been thinking of using hash function provide simple store , retrieve class first approach possible single hash key might resolve more 1 item. given add function takes return value of add argument cannot know item in list return.
class hashstore: def __init__(self): self.uniques = {} def add(self, big_hashable): hash_value = hash(big_hashable) if hash_value not in self.uniques: self.uniques[hash_value] = [big_hashable] elif big_hashable not in self.uniques[hash_value]: self.uniques[hash_value].append(big_hashable) return hash_value
another approach ends assuring there single mapping each unique hashable item.
class singlestore: def __init__(self): self.uniques = {} self.indexed = {} self.index = 0 def add(self, big_hashable): if big_hashable not in self.uniques: self.index += 1 self.uniques[big_hashable] = self.index self.indexed[self.index] = big_hashable return self.uniques[big_hashable]
this works , assures return value of add can used return unique value. seems bit clumsy. there better, more pythonic way of handling situation?
i've been ambiguous question. there 2 issues - 1 have millions of objects using keys ranging 100s 1000s of bytes each (the big_hashable thing). converting integers enable processing of more data can. secondly, keeping single canonical copy of each big_hashable thing cut down on memory usage well, though first issue driving question, because each key separate copy of big_hashable thing.
if don't need able efficiently retrieve canonical copy of object given different copy, can use set:
s = set() s.add(3) s.add(3) # s has 1 3 in
if need able efficiently retrieve canonical copies of objects, don't store them hash value - that'd horribly broken. use hashable directly.
class interner(object): def __init__(self): self._store = {} def canonical_object(self, thing): """returns canonical object equal thing. returns same result equal things. """ return self._store.setdefault(thing, thing)
with weakref
module, can improve not keep canonical object if client code lets go of it, built-in intern
function strings.
Comments
Post a Comment