Ijson is magic
I love Python, but it can be a little resource-hungry at times.
For example, allthethings.json is a 12MB json file. How much memory does a Python script take to load it? Let’s profile the memory usage of a script that just loads it.
import json
def load_json(filename):
with open(filename, 'r') as f:
return json.load(f)
if __name__ == "__main__":
load_json('allthethings.json')
Running memory_profiler on the script above:
$ python -m memory_profiler json_demo.py
Filename: json_demo.py
Line # Mem usage Increment Line Contents
================================================
3 13.430 MiB 0.000 MiB @profile
4 def load_json(filename):
5 13.434 MiB 0.004 MiB with open(filename, 'r') as f:
6 86.773 MiB 73.340 MiB return json.load(f)
73MB, more then 6 times the size of the original file.
This is not a big problem when loading a 12MB json on my desktop. But we needed to load a 200MB file on a Heroku standard-2x dyno, so we ended up having problems. What can help us in this case is that we only needed a subset of the data.
To use our allthethings.json example, imagine that the data we need is just the ‘shortname’ key for builders. Searching for a way to get just the data we needed without having to load the whole json in memory, I learned about ijson.
According to it’s documentation ijson is an:
Iterative JSON parser with a standard Python iterator interface
That means that instead of loading the whole file into memory and parsing everything at once, it uses iterators to lazily load the data. In that way, when we pass by a key that we don’t need, we can just ignore it and the generated object can be removed from memory.
import ijson
def load_json(filename):
with open(filename, 'r') as fd:
parser = ijson.parse(fd)
ret = {'builders': {}}
for prefix, event, value in parser:
if (prefix, event) == ('builders', 'map_key'):
buildername = value
ret['builders'][buildername] = {}
elif prefix.endswith('.shortname'):
ret['builders'][buildername]['shortname'] = value
return ret
if __name__ == "__main__":
load_json('allthethings.json')
Profiling its memory usage:
(venv)$ python -m memory_profiler ijson_demo.py
Filename: ijson_demo.py
Line # Mem usage Increment Line Contents
================================================
5 15.109 MiB 0.000 MiB @profile
6 def load_json(filename):
7 15.109 MiB 0.000 MiB with open(filename, 'r') as fd:
8 15.109 MiB 0.000 MiB parser = ijson.parse(fd)
9 15.113 MiB 0.004 MiB ret = {'builders': {}}
10 26.012 MiB 10.898 MiB for prefix, event, value in parser:
11 26.012 MiB 0.000 MiB if (prefix, event) == ('builders', 'map_key'):
12 25.824 MiB -0.188 MiB buildername = value
13 25.824 MiB 0.000 MiB ret['builders'][buildername] = {}
14 26.012 MiB 0.188 MiB elif prefix.endswith('.shortname'):
15 25.824 MiB -0.188 MiB ret['builders'][buildername]['shortname'] = value
16
17 26.012 MiB 0.188 MiB return ret
Now loading the file takes 11MB and the script total memory usage went from 86MB to 26MB. On our real production example, total memory usage went from 4.5GB to 300MB.
Backends
Ijson gave us a magic level of memory-saving, but there was a speed trade-off. Using the yalj2 backend helped to reduce our loss in speed. To use the yajl backend all we had to do was:
import ijson.backends.yajl2 as ijson
Before:
$ time python ijson_demo.py
real 0m2.966s
user 0m2.944s
sys 0m0.020s
After:
$ time python ijson_demo_yajl2.py
real 0m1.626s
user 0m1.613s
sys 0m0.013s
The downside of using the yajl2 backend is that it requires users to have a C library (libyajl2) installed on their systems. Luckily, it is possible to install it on Heroku.
tags : Mozilla