Contemplating Integration of Protocol Buffers into ZODB

I am now looking at integrating Google’s Protocol Buffers into ZODB. This means that wherever possible, ZODB should store a serialized protocol buffer rather than a pickle. The main thing my customer hopes to gain is language independence, but storing protocol buffers could open other possibilities as well. I think this is a very good idea and I’m going to pursue it.

Although ZODB has always clung tightly to Python pickles, I don’t think that moving to a different serialization format violates any basic ZODB design principle. On the other hand, I have tried to change the serialization format before; in particular, the APE project tried to make the serialization format completely pluggable. The APE project turned out not to be viable. Therefore, this project must not repeat the mistake that broke APE.

Why did APE fail? APE was not viable because it was too hard to debug. It was too hard to debug because storage errors never occurred near their source, so tracebacks were never very helpful. The only people who could solve such problems were those who were intimately familiar with the application, APE, and ZODB, all at the same time. The storage errors indicated that some application code was trying to store something the serialization layer did not like. The serialization layer had no ability to voice its objections until transaction commit, at which time the incompatible application code was no longer on the stack (and might even be on a different machine if ZEO was involved).

This turned out to be a much more serious problem than I anticipated. It made me realize that one of the top design concerns for large applications is producing high quality stack tracebacks in error messages. A quality traceback can pinpoint the cause of an error in seconds. I am not aware of any substitute for quality tracebacks, so I am now willing to sacrifice a lot to get them.

So I am faced with a basic design choice. Should protocol buffers be integrated into ZODB at the application layer, rather than behind a storage layer like APE did? If I choose to take this route, Persistent subclasses will need to explicitly store a protocol buffer. Storage errors will indeed occur mostly at their source, since the protocol buffer classes will check validity immediately.

Now that I’ve written this out, the right choice seems obvious: the main integration should indeed be done at the application layer. Until now, it was hard to distinguish this issue from other issues like persistent references and the format of database records.

Furthermore, the simplest thing to do at first is to store a protocol buffer object as an attribute of a normal persistent object, rather than my initial idea of creating classes that join Persistent with Google’s generated classes. That means we will still store a pickle, but the pickle will contain a serialized protocol buffer. Later on, I will figure out how to store a protocol buffer without a pickle surrounding it. I will also provide a method of storing persistent references, though it might be different from the method ZODB users are accustomed to.

2 thoughts on “Contemplating Integration of Protocol Buffers into ZODB”

Jonathan Ellis says:

January 7, 2009 at 4:28 pm

You could also look at Thrift, which is basically “Facebook’s protocol buffers, with way more supported languages and richer type support.” I recently submitted patches to clean up the generated Python code; one has been integrated and the other should be soon. There’s #thrift on freenode as well as a mailing list. I’m a fan.

http://wiki.apache.org/thrift/
Shane Hathaway says:

January 7, 2009 at 8:06 pm

Thanks for the pointer!

I wonder, though, why both Google and Facebook are creating Python code generators! My guess is that people think generated code is faster, which might very well be true in static languages, but in Python there is no advantage at all, since a static class definition in Python produces the very same thing that a dynamic definition would produce. Rather than generate code, I would strongly prefer to use an interface parser directly from Python. I could set up some sort of caching mechanism if the parser is slow, but I bet most of the interface parsers are actually quite fast.

Comments are closed.

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31