python - Is there another way to avoid duplication of large hashable objects? -


i processing text , have need store large sequences of hashable objects - strings, tuples of words, etc. i've been thinking of using hash function provide simple store , retrieve class first approach possible single hash key might resolve more 1 item. given add function takes return value of add argument cannot know item in list return.

class hashstore:     def __init__(self):         self.uniques = {}      def add(self, big_hashable):         hash_value = hash(big_hashable)         if hash_value not in self.uniques:             self.uniques[hash_value] = [big_hashable]         elif big_hashable not in self.uniques[hash_value]:             self.uniques[hash_value].append(big_hashable)          return hash_value 

another approach ends assuring there single mapping each unique hashable item.

class singlestore:     def __init__(self):         self.uniques = {}         self.indexed = {}         self.index = 0      def add(self, big_hashable):         if big_hashable not in self.uniques:             self.index += 1             self.uniques[big_hashable] = self.index             self.indexed[self.index] = big_hashable          return self.uniques[big_hashable] 

this works , assures return value of add can used return unique value. seems bit clumsy. there better, more pythonic way of handling situation?

i've been ambiguous question. there 2 issues - 1 have millions of objects using keys ranging 100s 1000s of bytes each (the big_hashable thing). converting integers enable processing of more data can. secondly, keeping single canonical copy of each big_hashable thing cut down on memory usage well, though first issue driving question, because each key separate copy of big_hashable thing.

if don't need able efficiently retrieve canonical copy of object given different copy, can use set:

s = set() s.add(3) s.add(3) # s has 1 3 in 

if need able efficiently retrieve canonical copies of objects, don't store them hash value - that'd horribly broken. use hashable directly.

class interner(object):     def __init__(self):         self._store = {}     def canonical_object(self, thing):         """returns canonical object equal thing.          returns same result equal things.          """          return self._store.setdefault(thing, thing) 

with weakref module, can improve not keep canonical object if client code lets go of it, built-in intern function strings.


Comments

Popular posts from this blog

java - Run a .jar on Heroku -

java - Jtable duplicate Rows -

validation - How to pass paramaters like unix into windows batch file -