Ijson is magic

Jul. 23, 2015

I love Python, but it can be a little resource-hungry at times.

For example, allthethings.json is a 12MB json file. How much memory does a Python script take to load it? Let’s profile the memory usage of a script that just loads it.

import json

def load_json(filename):
    with open(filename, 'r') as f:
        return json.load(f)

if __name__ == "__main__":
    load_json('allthethings.json')

Running memory_profiler on the script above:

$ python -m memory_profiler json_demo.py
Filename: json_demo.py

Line #    Mem usage    Increment   Line Contents
================================================
     3   13.430 MiB    0.000 MiB   @profile
     4                             def load_json(filename):
     5   13.434 MiB    0.004 MiB       with open(filename, 'r') as f:
     6   86.773 MiB   73.340 MiB           return json.load(f)

73MB, more then 6 times the size of the original file.

This is not a big problem when loading a 12MB json on my desktop. But we needed to load a 200MB file on a Heroku standard-2x dyno, so we ended up having problems. What can help us in this case is that we only needed a subset of the data.

To use our allthethings.json example, imagine that the data we need is just the ‘shortname’ key for builders. Searching for a way to get just the data we needed without having to load the whole json in memory, I learned about ijson.

According to it’s documentation ijson is an:

Iterative JSON parser with a standard Python iterator interface

That means that instead of loading the whole file into memory and parsing everything at once, it uses iterators to lazily load the data. In that way, when we pass by a key that we don’t need, we can just ignore it and the generated object can be removed from memory.

import ijson

def load_json(filename):
    with open(filename, 'r') as fd:
        parser = ijson.parse(fd)
        ret = {'builders': {}}
        for prefix, event, value in parser:
            if (prefix, event) == ('builders', 'map_key'):
                buildername = value
                ret['builders'][buildername] = {}
            elif prefix.endswith('.shortname'):
                ret['builders'][buildername]['shortname'] = value

        return ret

if __name__ == "__main__":
    load_json('allthethings.json')

Profiling its memory usage:

(venv)$ python -m memory_profiler ijson_demo.py
Filename: ijson_demo.py

Line #    Mem usage    Increment   Line Contents
================================================
     5   15.109 MiB    0.000 MiB   @profile
     6                             def load_json(filename):
     7   15.109 MiB    0.000 MiB       with open(filename, 'r') as fd:
     8   15.109 MiB    0.000 MiB           parser = ijson.parse(fd)
     9   15.113 MiB    0.004 MiB           ret = {'builders': {}}
    10   26.012 MiB   10.898 MiB           for prefix, event, value in parser:
    11   26.012 MiB    0.000 MiB               if (prefix, event) == ('builders', 'map_key'):
    12   25.824 MiB   -0.188 MiB                   buildername = value
    13   25.824 MiB    0.000 MiB                   ret['builders'][buildername] = {}
    14   26.012 MiB    0.188 MiB               elif prefix.endswith('.shortname'):
    15   25.824 MiB   -0.188 MiB                   ret['builders'][buildername]['shortname'] = value
    16                            
    17   26.012 MiB    0.188 MiB           return ret

Now loading the file takes 11MB and the script total memory usage went from 86MB to 26MB. On our real production example, total memory usage went from 4.5GB to 300MB.

Backends

Ijson gave us a magic level of memory-saving, but there was a speed trade-off. Using the yalj2 backend helped to reduce our loss in speed. To use the yajl backend all we had to do was:

import ijson.backends.yajl2 as ijson

Before:

$ time python ijson_demo.py

real    0m2.966s
user    0m2.944s
sys    0m0.020s

After:

$ time python ijson_demo_yajl2.py

real    0m1.626s
user    0m1.613s
sys    0m0.013s

The downside of using the yajl2 backend is that it requires users to have a C library (libyajl2) installed on their systems. Luckily, it is possible to install it on Heroku.

tags : Mozilla