Aapeli Vuorinen

Protocol Buffers vs JSON

Protocol buffers (protobufs, protos, or pb) and JavaScript Object Notation (JSON) are two ways of serialising data.

A serialisation method tells you how to convert an object to bytes, and convert bytes back to an object. This is useful if you want to send it over a wire to another app or just save into a file. Not all objects are serialisable (at least not sensibly), so these libraries only serialise fixed structures such as dictionaries, arrays, primitive fields, and combinations of these. For example, you can’t sensibly send a general function over the internet.

The code examples here are in Python, but the main points stand without reference to any one particular language.

JSON

JSON is a very rudimentary method of turning structures into human-readable strings. It came out of the Javascript world and gained huge popularity for being really easy to use and inspect.

There is no one specific standard that defines JSON (there is one, but it’s not universally adopted and adhered to).

It’s a great lowest-common denominator specification: almost any language knows how to serialise and deserialise JSON, and it’s easy to inspect an object and figure out what it contains.

To use JSON in Python, you basically just dump and load objects from and to strings, an incredibly easy thing to do. There is no fixed structure for those objects, and you can load and dump almost anything you wish.

Protocol buffers

Protocol buffers are a language independent data serialisation format that came out of Google. It’s extremely fast and efficient.

It’s actually more than a standard to convert structures to bytes and back, since it also ships with a code generator and runtime library that does all the mechanical work for you.

To use protobufs, you have to first define the message format, then generate serialisation and deserialisation code for those objects. This is a bit more extra work at the start, but the idea is that you know precisely what to expect and what you’ll get. Objects that don’t conform to your specification will result in an error at serialisation (or deserialisation). This stops you from having to chase down the service/location/line where that object was created and serialised, and you can instead look directly at the message definition.

Google designed Protobufs for a number of reasons; some of which apply to most users (e.g. keeping track of what messages get called from and to services in a version controlled environment, having a completely fixed specification of what messages mean), and others that aren’t that important for the majority of common use cases (e.g. speed of (de/)serialisation, size of serialised objects).

Example code for Python

JSON

In JSON, you just pass an object to json.dumps (dump into string), and retrieve with json.loads (load from string)

# in Python standard library
import json

my_obj_json = {
    "key": "value",
    "number": 42,
    "person": {
        "username": "aapeli",
        "display_name": "Aapeli Vuorinen",
        "location": "New York City"
    }
}

json_serialised = json.dumps(my_obj_json).encode('utf8')

print(json_serialised)
# b'{"key": "value", "number": 42, "person": {"username": "aapeli", "display_name": "Aapeli Vuorinen", "location": "New York City"}}'

my_obj_json_deserialised = json.loads(json_serialised.decode('utf8'))

print(my_obj_json_deserialised)

# {'key': 'value', 'number': 42, 'person': {'username': 'aapeli', 'display_name': 'Aapeli Vuorinen', 'location': 'New York City'}}

print(f"Location: {my_obj_json_deserialised['person']['location']}")
# Location: New York City

Protobuf:

With Protobuf, we have to first define our message format, so now we’ll use the object form above and turn it into a protobuf message:

syntax = "proto3";

message Person {
    // Describes an employee
    string username = 1;
    string display_name = 2;
    string location = 3;
}

message Msg {
    // An example message for explaining protobuf
    string key = 1;
    int64 number = 2;

    Person person = 3;
}

Doing this in Python:

with open("sample.proto", "w") as f:
    f.write("""
    syntax = "proto3";

    message Person {
        // Describes an employee
        string username = 1;
        string display_name = 2;
        string location = 3;
    }

    message Msg {
        // An example message for explaining protobuf
        string key = 1;
        int64 number = 2;

        Person person = 3;
    }
    """)

We then need to compile that proto file into Python code with the protoc (Protobuf Compiler):

protoc --python_out=. sample.proto

You can then serialise and deserialise object in Python:

# generated by protoc, uses the protobuf library from pypi
import sample_pb2

my_obj_pb = sample_pb2.Msg(
    key="value",
    number=42,
    person=sample_pb2.Person(
        username="aapeli",
        display_name="Aapeli Vuorinen",
        location="New York City"
    )
)

pb_serialised = my_obj_pb.SerializeToString()

print(pb_serialised)
# b'\n\x05value\x10*\x1a(\n\x06aapeli\x12\x0fAapeli Vuorinen\x1a\rNew York City'

my_obj_pb_deserialised = sample_pb2.Msg().FromString(pb_serialised)

print(my_obj_pb_deserialised)
# key: "value"
# number: 42
# person {
#   username: "aapeli"
#   display_name: "Aapeli Vuorinen"
#   location: "New York City"
# }

print(f"Location: {my_obj_pb_deserialised.person.location}")
# Location: New York City

Now we’ll move to a comparison of the two serialisation methods:

Discussion on usability

The first thing I’m sure you’ll notice is that at first, using Protobuf seems much more difficult and complicated than the good old JSON we’re all used to.

I agree, you have to define your message structure, you have to compile some stuff, you’ve got to import special generated code, etc. Let’s agree that at first there’s a bit of a learning curve (but be honest, it’s not super complicated).

But that’s because Protobuf solves more than just the issue of turning structures into bytes. It solves the issue of deciding how to ship stuff around the network. See, protobufs give you a single source of truth as to what your message format is. It helps you keep track of your message definitions, simplifying your life going further.

Binary format

print(json_serialised)
# b'{"key": "value", "number": 42, "person": {"username": "aapeli", "display_name": "Aapeli Vuorinen", "location": "New York City"}}'

print(pb_serialised)
# b'\n\x05value\x10*\x1a(\n\x06aapeli\x12\x0fAapeli Vuorinen\x1a\rNew York City'

One handy property of JSON is that it’s self-describing: if you look at any message, you know exactly how it translates to an object in Python. This is indeed a handy property that’s missing from Protobuf (though that’s by design: it keeps protobuf messages short, something that’s important for some, but not all use cases).

JSON clearly wins in this aspect: it’s really easy to look at JSON objects and figure out what’s missing, or what’s going on.

With Protobuf, you need the message definition to decode a serialised message.

# write the data to file
with open("sample.bin", "wb") as f:
    f.write(pb_serialised)

Then in sh:

# deserialise protobuf message from file in sh
$ cat sample.bin | protoc --decode=Msg sample.proto
# key: "value"
# number: 42
# person {
#   username: "aapeli"
#   display_name: "Aapeli Vuorinen"
#   location: "New York City"
# }

Another thing is size:

len(json_serialised)
# 128

len(pb_serialised)
# 51

print(f"protobuf is {round(len(json_serialised)/len(pb_serialised), 3)} times smaller than JSON (in this particular case)")
# protobuf is 2.51 times smaller than JSON (in this particular case)

Finally, it’s important to note that JSON doesn’t really do binary data (raw bytes). You need to base64 encode them, or something else similar; which is a pain to do, and also very space inefficient. Note that in this case most users do care about size, as most binary files (e.g. images, videos, executables) are fairly large.

Usage in Python

JSON deserialises into Python dictionaries (well, this is because most JSON objects are dictionaries). This means that either one has to standardise dictionaries as the universal object in code, or manually do some “extra deserialisation” from the Python dictionary into a Python object. It is extremely easy to fall into the first pitfall of starting to use dictionaries (then soon lists of dictionaries) to move data around within Python code.

Because Python is duck typed, it’s extremely easy to fall into a habit of just passing around opaque Python dictionaries around the place from one function to another. (By opaque I mean that you might not have any idea on what the object really contains, and there’s no way for you to check other than by debugging and checking, or chasing down the whole program flow to figure out what that might contains).

As an example, it’s super easy to fall into this trap:

def person_info(msg):
    return f'{msg["person"]["display_name"]} from {msg["person"]["location"]} has username "{msg["person"]["username"]}".'

print(person_info(my_obj_json_deserialised))
# Aapeli Vuorinen from New York City has username "aapeli".

I’m sure none of us would straight up write this if being careful, but we’re all lazy, and it’s so easy to just hack together a method and get the whole message, then grab that information from “person” within the message. It’s incredibly easy to fall into this trap.

Protobufs forces you to define your message structures, whereby it encourages you to write clean code. By forcing you to write message definitions, it causes you to think about what we want those messages to contain. For example, it’s normally a really bad idea to verbatim copy an object from an external API and pass that around in an internal system, whereby one implicitly causes that external definition to become the internal one.

Finally, protobufs are just nicer to use in Python. They roughly translate to dataclasses, so we can use attribute notation (my_obj.key) instead of accessing dictionaries (my_obj["key"]). (I personally prefer this by far, look at how confusing the person_info function looks like above, it’s just messy.)

Similarly, protobufs are just fine to pass around as objects in a system, so it’d be OK to write the following:

def person_info(person: sample_pb2.Person):
    return f'{person.display_name} from {person.location} has username "{person.username}".'

print(person_info(my_obj_pb_deserialised.person))
# Aapeli Vuorinen from New York City has username "aapeli".

Note that it’s dead simple to go find sample.proto, then find the Person message definition inside that file, and you’d straight away know what’s being passed around.

Strong typing, being exactly defined

One problem with JSON is that it’s a weird mix between what humans think things are, and what machines think things are.

So for example, in JSON the following is all good:

{"number": 1000000000000000000000000000000000000}

The problem, however, is that that integer is huge. And in fact, it doesn’t fit into a 64-bit integer, and it’s not natively representable as a primitive in a lot of programming languages.

In Python it’s fine (because Python implements arbitrary integers), but if you ever needed to write, say a high-performance sub-component in some other language, you’d be screwed. To load such a thing in C++ for instance requires a “big number” implementation or something.

This is but one example of the complications of JSON. The general problem is that there’s no real standard for JSON. There’s technically one, but implementations and programming languages can’t seem to agree on using it. So in reality there’s a few dozen different versions of it.

Protocol buffers, on the other hand are exactly defined. Each allowed primitive type (e.g. string) has a very precise definition (e.g. UTF-8 encoded bytes representing a string literal) that’s precisely adhered to in each implementation. There’s a great document explaining how protobufs are encoded, which tells you how it all works. (I encourage you to go read it anyway, it’s enlightening).

Adoption

JSON is generally the “language of the web”, and has huge adoption. Most web APIs talk JSON.

Protobuf on the other hand is a bit newer, but is quickly gaining adoption. This is partly because Google uses protobuf almost exclusively for all of their message serialisation (and a lot more), and they’re pushing it heavily.

There is plenty of documentation and help available online for both. Both have an abundance of tools and utilities to make life easier and simpler.

Summary

The main advantages of JSON are that it’s dead simple to see what the message contains, and everyone knows how it works (or can figure it out in 10 min).

The main advantages of Protobufs is that it you know exactly what you get, and you know where to go look up this information. Message formats are versioned along with all other source code, and you have a dead clear contract on what is pushed around the network or saved to disk. Messages encode and decode directly into something akin to Python dataclasses and are super easy to work with in code.