Memoise All The Things

On python caching strategies

Posted by Katie McLaughlin on November 30, 2015

I had no idea that memoisation was a thing yesterday. Sure, I knew that a good idea was to save results for later, but I didn’t know about this concept.

What’s also nice is just how simple it was able to be implemented in Python[0].

In octohatrack, I have a number of functions that call out to various parts of the GitHub API. I refactored a bit of the underlying to make these calls take only a URI, and output only a parsed JSON blob.

From there, all that needed to be added was a @memoise function decorator, which I declared to be such:

from functools import wraps

cache = {}

def memoise(wrapped):
  def wrapper(*args, **kwargs):
    key = args[0]
    if key not in cache:
       cache[key] = wrapped(*args, **kwargs)
    return cache[key]

  return wrapper

Now as I understand it, what this enables is for a cache dict to store the results of any function calls with this dectorator. This is extremely useful for long-running processes, as this is all stored in memory.

However, in the case of octohatrack, I run the script once and then it stops. My use case is to have the cache live longer than the memory allocation for the python process.

json.dump to the rescue!

I can’t hear yaml.load without it being in an American accent, but in this particular case, I’m using json.dump and json.load to export/import results from prior API calls without any validation (apart from “Is the cache file valid JSON?”. This is probably not the best idea in a web-client, or where the input of the JSON file is from untrusted sources.

But for the case of an isolated script running on a local file system, I’m sure it’s fine.

So what I do is run the json.load on the cache file when the program aunches. Simple enough. But on exit, there’s a fancy Python thing you can do:

import atexit

cache_file = "cache_file.json"

# Always run on exit
def save_cache():
  with open(cache_file, 'w') as f:
  json.dump(cache, f)


The save_cache() call will always be run on exiting the program, succesful or not. This means that if, say, the GitHub API rate-limiting kicks in, the program should dump what it got to file. Once the rate-limiting timeout expires, running the program again will not have to re-run the calls from the last iteration, since they will be stored in the cache file, and thus loaded into memory on start.

This should mean that octohatrack can finally do longer running calls where the amount of API calls required to process an entire GitHub repo exceeds the rate limit. Neat!

[0] - This is still an open Pull Request at the time of writing. If you can suggest any improvements, please do!