Did a quick comparison of some data serialization options for Python. My requirements for the serialization format were the following:

  • Input data is typically either a list or a dictionary.
  • Interoperability is important and must be compatible with at least C.
  • A human readable format is desirable but not necessary.

Based on the requirements, I took a look at the following Python packages:

The following Python script was used to test out the various packages:

import os
import time

import umsgpack
import yaml
import cjson
import ujson

DATA = [{'val1':12345, 'val2':[1,2,3,4,5], 'val3':"12345"} for _ in range(10000)]

def test_serialization(name, encode, decode):
    print name
    print "  Encoding..."
    t_start = time.clock()
    packed = encode(DATA)
    print "    time = %f seconds" % (time.clock() - t_start)
    print "    size = %u kilobytes" % (len(packed) / 1024)
    print "  Decoding..."
    t_start = time.clock()
    unpacked = decode(packed)
    print "    time = %f seconds" % (time.clock() - t_start)
    print "    same = %r" % (DATA == unpacked)

test_serialization("umsgpack", umsgpack.packb, umsgpack.unpackb)
test_serialization("yaml", yaml.dump, yaml.load)
test_serialization("cjson", cjson.encode, cjson.decode)
test_serialization("ujson", ujson.encode, ujson.decode)

The result of running this script on my laptop (Intel Core i7 2670QM) is the following:

    time = 0.390241 seconds
    size = 341 kilobytes
    time = 0.430256 seconds
    same = True
    time = 8.266586 seconds
    size = 527 kilobytes
    time = 15.943908 seconds
    same = True
    time = 0.030977 seconds
    size = 576 kilobytes
    time = 0.022119 seconds
    same = True
    time = 0.013703 seconds
    size = 478 kilobytes
    time = 0.018000 seconds
    same = True

For my particular application, speed is more important than size of the serialized data. The clear winner for speed is ujson. For size, msgpack is slightly better than ujson which makes sense since it is a binary format.

Overall, I am very impressed by the performance of ujson. Given the ubiquity of JSON for web-based data, it makes sense that ultra optimized libraries would exist for it. While I love YAML as a data format, the performance of the PyYAML library is not suitable for applications requiring fast encoding/decoding times.

