Did a quick comparison of some data serialization options for Python. My requirements for the serialization format were the following:
- Input data is typically either a list or a dictionary.
- Interoperability is important and must be compatible with at least C.
- A human readable format is desirable but not necessary.
Based on the requirements, I took a look at the following Python packages:
- PyYaml
- python-cjson
- ujson
- u-msgpack-python
- msgpack-python NOTE: Disqualified since had trouble running on Windows.
The following Python script was used to test out the various packages:
import os
import time
import umsgpack
import yaml
import cjson
import ujson
DATA = [{'val1':12345, 'val2':[1,2,3,4,5], 'val3':"12345"} for _ in range(10000)]
def test_serialization(name, encode, decode):
print name
print " Encoding..."
t_start = time.clock()
packed = encode(DATA)
print " time = %f seconds" % (time.clock() - t_start)
print " size = %u kilobytes" % (len(packed) / 1024)
print " Decoding..."
t_start = time.clock()
unpacked = decode(packed)
print " time = %f seconds" % (time.clock() - t_start)
print " same = %r" % (DATA == unpacked)
test_serialization("umsgpack", umsgpack.packb, umsgpack.unpackb)
test_serialization("yaml", yaml.dump, yaml.load)
test_serialization("cjson", cjson.encode, cjson.decode)
test_serialization("ujson", ujson.encode, ujson.decode)
The result of running this script on my laptop (Intel Core i7 2670QM) is the following:
umsgpack
Encoding...
time = 0.390241 seconds
size = 341 kilobytes
Decoding...
time = 0.430256 seconds
same = True
yaml
Encoding...
time = 8.266586 seconds
size = 527 kilobytes
Decoding...
time = 15.943908 seconds
same = True
cjson
Encoding...
time = 0.030977 seconds
size = 576 kilobytes
Decoding...
time = 0.022119 seconds
same = True
ujson
Encoding...
time = 0.013703 seconds
size = 478 kilobytes
Decoding...
time = 0.018000 seconds
same = True
For my particular application, speed is more important than size of the serialized data. The clear winner for speed is ujson. For size, msgpack is slightly better than ujson which makes sense since it is a binary format.
Overall, I am very impressed by the performance of ujson. Given the ubiquity of JSON for web-based data, it makes sense that ultra optimized libraries would exist for it. While I love YAML as a data format, the performance of the PyYAML library is not suitable for applications requiring fast encoding/decoding times.