Ijson is magic
I love Python, but it can be a little resource-hungry at times.
For example, allthethings.json is a 12MB json file. How much memory does a Python script take to load it? Let’s profile the memory usage of a script that just loads it.
Running memory_profiler on the script above:
73MB, more then 6 times the size of the original file.
This is not a big problem when loading a 12MB json on my desktop. But we needed to load a 200MB file on a Heroku standard-2x dyno, so we ended up having problems. What can help us in this case is that we only needed a subset of the data.
To use our allthethings.json example, imagine that the data we need is just the ‘shortname’ key for builders. Searching for a way to get just the data we needed without having to load the whole json in memory, I learned about ijson.
According to it’s documentation ijson is an:
Iterative JSON parser with a standard Python iterator interface
That means that instead of loading the whole file into memory and parsing everything at once, it uses iterators to lazily load the data. In that way, when we pass by a key that we don’t need, we can just ignore it and the generated object can be removed from memory.
Profiling its memory usage:
Now loading the file takes 11MB and the script total memory usage went from 86MB to 26MB. On our real production example, total memory usage went from 4.5GB to 300MB.
Backends
Ijson gave us a magic level of memory-saving, but there was a speed trade-off. Using the yalj2 backend helped to reduce our loss in speed. To use the yajl backend all we had to do was:
Before:
After:
The downside of using the yajl2 backend is that it requires users to have a C library (libyajl2) installed on their systems. Luckily, it is possible to install it on Heroku.
tags : Mozilla