Friday, 13 September 2013

Optimal algorithm for cache with expire

Optimal algorithm for cache with expire

I need simple cache structure (in python, but it doesn't really matter),
with some specific requirements:
Up to several millions of small objects (100 bytes on average)
Speed is the key (both put and get), I'd expect operation times at about
few microseconds
Only one thread accessing this - so it can be all just in memory (do not
need persistence)
Keys are MD5 hashes (if it matters)
There's an expiration time, global for the cache - every key should be
removed from the cache after expiration time, counting from the time of
first put
Now, the point is how to implement expiration - as everything other can be
done using simple dictionary. The simplest solution - to iterate all data
regularly and remove expired keys - could lock whole cache for too long.
It could be improved by iterating parts of the data with every cleanup
process - but still it will take some time (or won't clean it fast
enough). Also removing keys one by one looks like the waste of CPU - as
they could be removed in batches (don't have to be removed just after
expiration - we can afford some extra RAM for keeping expired keys a
little bit longer).
Checking keys during the retrieve is not enough (although it should be
done nevertheless, to not return expired keys) - as many keys can be never
retrieved and then they will stay forever (or just too long).
Most answers for that problem suggest using memcached, but I think this
will be waste of CPU, especially as I keep objects which can be put to the
dictionary by the reference, but using memcached they would have to be
(de)serialized.
I have some idea how to implement this: split data into time slices,
having actually several dictionaries - for example, if expire time is 60
seconds, then we have (at most) 4 dictonaries and every 20 seconds we add
new one - where new keys are put, and remove the 4th one - where we'll
have keys added over 60 seconds ago. This makes cleaning very fast at the
cost of retrieve time, where you need to lookup in 4 dictionaries instead
of one (and RAM usage increased by 33%).
So finally the question - which is: is there any better solution? Or maybe
I'm wrong and some of mentioned solutions (removing keys one by one) would
be better and faster? I don't want to reinvent the wheel, but didn't find
any good solution in the net.

No comments:

Post a Comment