keas.pbpersist and keas.pbstate

Remember how I was talking about serializing data in ZODB using Google Protocol Buffers instead of pickles?  Well, Keas Inc. suggested the idea and asked me to work on it.  The first release of a package combination that implements the idea is ready:

  • keas.pbstate 0.1.1 (on PyPI), which helps you write classes that store all state in a Protocol Buffer message.
  • A patch for ZODB (available at packages.willowrise.org) that makes it possible to plug in serializers other than ZODB’s standard pickle format.
  • keas.pbpersist 0.1 (on PyPI), which registers a special serializer with ZODB so that ProtobufState objects get stored without any pickling.

This code is new, but the tests all pass and the coverage tests claim that every line of code is run by the tests at least once.  I hope these packages get used so we can find out their flaws.  I did my best to make keas.pbstate friendly, but there are some areas, object references in particular, where I don’t like the current syntax.  I don’t know if this code is fast–optimization would be premature!

I should mention that while working with protobuf, I got the feeling that C++ is the native language and Java and Python are second class citizens.  I wonder if I’d have the same feeling with other serialization formats.

ZODB + Protobuf Looking Good

Today I drafted a module that mixes ZODB with Protocol Buffers.  It’s looking good!  I was hoping not to use metaclasses, but metaclasses solved a lot of problems, so I’ll keep them unless they cause deeper problems.

The code so far lets you create a class like this:

class Tower(Persistent):
    __metaclass__ = ProtobufState
    protobuf_type = Tower_pb

    def __init__(self):
        self.name = "Rapunzel's"
        self.height = 1000.0
        self.width = 10.0

    def __str__(self):
        return '%s %fx%f %d' % (
            self.name, self.height, self.width, self.build_year)

The only special lines are the second and third.  The second line says the class should use a metaclass that causes all persistent state to be stored in a protobuf message.  The third line says which protobuf class to use.  The protobuf class comes from a generated Python module (not shown).

If you try to store a string in the height attribute, Google’s code raises an error immediately, which is probably a nice new benefit.  Also, this Python code does not have to duplicate the content of the .proto file.  Did you notice the build_year attribute?  Nothing sets it, but since it’s defined in the message schema, it’s possible to read the default value at any time.  Here is the message schema, by the way:

message Tower {
    required string name = 1;
    required float width = 2;
    required float height = 3;
    optional int32 build_year = 4 [default = 1900];
}

The metaclass creates a property for every field defined in the protobuf message schema.  Each property delegates the storage of an attribute to a message field.  The metaclass also adds __getstate__, __setstate__, and __new__ methods to the class; these are used by the pickle module.  Since ZODB uses the pickle module, the serialization of these objects in the database is now primarily a protobuf message.  Yay!  I accomplished all of this with only about 100 lines of code, including comments. 🙂

I have a solution for subclassing that’s not beautiful, but workable.  Next, I need to work on persistent references and _p_changed notification.  Once I have those in place, I intend to release a pre-alpha and change ZODB to remove the pickle wrapper around database objects that are protobuf messages.  At that point, when combined with RelStorage, ZODB will finally store data in a language independent format.

As a matter of principle, important data should never be tied to any particular programming language.  Still, the fact that so many people use ZODB even without language independence is a testament to how good ZODB is.

Jonathan Ellis pointed out Thrift, an alternative to protobuf.  Thrift looks like it could be good, but my customer is already writing protobuf code, so I’m going to stick with that for now.  I suspect almost everything I’m doing will be be applicable to Thrift anyway.

I know what some of you must be thinking: what about storing everything in JSON like CouchDB?  I’m sure it could be done, but I don’t yet see a benefit.  Maybe someone will clue me in.

Contemplating Integration of Protocol Buffers into ZODB

I am now looking at integrating Google’s Protocol Buffers into ZODB.  This means that wherever possible, ZODB should store a serialized protocol buffer rather than a pickle.  The main thing my customer hopes to gain is language independence, but storing protocol buffers could open other possibilities as well.  I think this is a very good idea and I’m going to pursue it.

Although ZODB has always clung tightly to Python pickles, I don’t think that moving to a different serialization format violates any basic ZODB design principle.  On the other hand, I have tried to change the serialization format before; in particular, the APE project tried to make the serialization format completely pluggable.  The APE project turned out not to be viable.  Therefore, this project must not repeat the mistake that broke APE.

Why did APE fail?  APE was not viable because it was too hard to debug.  It was too hard to debug because storage errors never occurred near their source, so tracebacks were never very helpful.  The only people who could solve such problems were those who were intimately familiar with the application, APE, and ZODB, all at the same time.  The storage errors indicated that some application code was trying to store something the serialization layer did not like.  The serialization layer had no ability to voice its objections until transaction commit, at which time the incompatible application code was no longer on the stack (and might even be on a different machine if ZEO was involved).

This turned out to be a much more serious problem than I anticipated.  It made me realize that one of the top design concerns for large applications is producing high quality stack tracebacks in error messages.  A quality traceback can pinpoint the cause of an error in seconds.  I am not aware of any substitute for quality tracebacks, so I am now willing to sacrifice a lot to get them.

So I am faced with a basic design choice.  Should protocol buffers be integrated into ZODB at the application layer, rather than behind a storage layer like APE did?  If I choose to take this route, Persistent subclasses will need to explicitly store a protocol buffer.  Storage errors will indeed occur mostly at their source, since the protocol buffer classes will check validity immediately.

Now that I’ve written this out, the right choice seems obvious: the main integration should indeed be done at the application layer.  Until now, it was hard to distinguish this issue from other issues like persistent references and the format of database records.

Furthermore, the simplest thing to do at first is to store a protocol buffer object as an attribute of a normal persistent object, rather than my initial idea of creating classes that join Persistent with Google’s generated classes.  That means we will still store a pickle, but the pickle will contain a serialized protocol buffer.  Later on, I will figure out how to store a protocol buffer without a pickle surrounding it.  I will also provide a method of storing persistent references, though it might be different from the method ZODB users are accustomed to.