keas.pbpersist and keas.pbstate

Remember how I was talking about serializing data in ZODB using Google Protocol Buffers instead of pickles?  Well, Keas Inc. suggested the idea and asked me to work on it.  The first release of a package combination that implements the idea is ready:

  • keas.pbstate 0.1.1 (on PyPI), which helps you write classes that store all state in a Protocol Buffer message.
  • A patch for ZODB (available at packages.willowrise.org) that makes it possible to plug in serializers other than ZODB’s standard pickle format.
  • keas.pbpersist 0.1 (on PyPI), which registers a special serializer with ZODB so that ProtobufState objects get stored without any pickling.

This code is new, but the tests all pass and the coverage tests claim that every line of code is run by the tests at least once.  I hope these packages get used so we can find out their flaws.  I did my best to make keas.pbstate friendly, but there are some areas, object references in particular, where I don’t like the current syntax.  I don’t know if this code is fast–optimization would be premature!

I should mention that while working with protobuf, I got the feeling that C++ is the native language and Java and Python are second class citizens.  I wonder if I’d have the same feeling with other serialization formats.

RelStorage 1.1.2

This release has two new useful features, one for performance, one for safety.

The performance feature is that if you use both the cache-servers and poll-interval options, RelStorage will use the cache to distribute basic change notifications.  That means we get to lighten the load on the database using the poll-interval, yet changes should still be seen instantly on all clients.  Yay! 🙂

The only drawback I expect is that caching makes debugging more difficult.  Still, this option should help people build enormous clusters, like the one my current customer was planning to build, although I got word today that they have changed their mind.

The new safety feature is the pack-dry-run option, which lets you run only the nondestructive pre_pack phase to get a list of everything that would be deleted by the pack phase.  This should be particularly useful if you’re trying out packing for the first time on a big database.  My current customer would have benefited from this too.

I also fixed a bug that caused the pack code to not remove as much old stuff as it should and I started using PyPI instead of the wiki as the main web page.  Using PyPI means I have to maintain only one README, which gets translated automatically into the PyPI page.  Until now I’ve had to maintain both the README and the wiki page.

http://pypi.python.org/pypi/RelStorage/1.1.2

Patched ZODB3 Eggs Available

This week, I put up some ZODB3 eggs and source distributions with the patch required for RelStorage already applied. I built both ZODB3-3.8.1-polling and ZODB3-3.7.3-polling.  I even made eggs for Windows developers who have not yet taken the time to set up MinGW. 😉

http://packages.willowrise.org/

Developers can use this web site in buildout.cfg to incorporate RelStorage in their applications.  Feel free to mirror the site if you need to.

Promoting the Zope Component Architecture

I just sent this bit of promotion for the Zope Component Architecture to a friend (paraphrased slightly):

The Zope community uses adaptation for a new purpose. While adaptation is classically a way to force uncooperative classes to communicate, that is not the intent of the component architecture.

The intent of adapters in Zope is to expand objects’ contracts to fulfill application requirements while not polluting reusable code.  Adapters act like a streamlined form of model/view separation.

The community’s experience with Zope 2 was that while we did our best to keep code modular and reusable, there were simply too many cases where we needed to change the behavior of a base component in order to fulfill application requirements. We badly needed a way to use some kind of model/view separation at any point in the code.

Using Zope 2, we tried several industry solutions to address this, but using any of those solutions, we still found that application-specific dependencies had to creep into otherwise reusable code in order to meet reasonable deadlines. That really hurt too, because we were a bunch of hard-core OO developers and we really hated breaking modularity. We knew we were incurring a long term debt.

The Zope component architecture is the little gem that resulted from that long experience. It helps programmers avoid creating application-specific dependencies at every level. The component architecture resembles other indirection frameworks like Spring and AOP, but I believe it solves more problems elegantly.

That said, the Zope community now has enough experience with the component architecture to know that the first time we applied it in Zope 3, we applied too much of it in some places. Thus Zope 3 is currently somewhat overgeneralized. Like any indirection framework, you have to gain some experience before you learn what indirections are appropriate.

How am I doing? I think the ZCA would be quite valuable for my friend, who is today an excellent Java designer and coder. I want to find the right words to express why it would be valuable to him.

Easy Workaround for zc.buildout

Problem: running “python bootstrap.py” or “bin/buildout” often produces scripts that mix up the Python package search path due to some packages being installed system-wide.  Version conflicts result.

Workaround: use “python -S bootstrap.py” and “python -S bin/buildout”.  Magically, no more version conflicts.

I wish I had thought of that before.  Duh!

Update: Another tip for new zc.buildout users I’ve been meaning to mention is that you should create a preferences file in your home directory so that downloads and eggs are cached centrally.  This makes zc.buildout much friendlier.  Do this:

mkdir ~/.buildout
echo "[buildout]" >> ~/.buildout/default.cfg
echo "eggs-directory = $HOME/.buildout/eggs" >> ~/.buildout/default.cfg
echo "download-cache = $HOME/.buildout/cache" >> ~/.buildout/default.cfg

It seems a bit silly that zc.buildout doesn’t have these settings by default.  They make zc.buildout behave a lot like Apache Maven, which is what a lot of Java shops are using these days.  Both zc.buildout and Maven are great tools once you get to know them, but both are a real pain to understand at first.

What Would ZODB + Paxos Look Like?

I just learned about the Paxos algorithm. I think we might be able to use it to create a fully distributed version of ZODB. I found a document that explains Paxos in simple terms.  Now I’m interested in learning about any ideas and software that might support integration of Paxos into ZODB.  I would also like to know how much interest people have in such a project.

I think ZODB’s transaction layer already implements a sort of squashed version of Paxos, but it’s not currently possible to separate the pieces to make it distributed.  To me, “distributed ZODB” means multiple servers accept writes while assuring consistency at all times.  I also require sub-millisecond response timing on the majority of read operations, since that is what ZODB applications have come to rely upon.  I suspect the speed requirement disqualifies systems like CouchDB.

Egg Patching Solution #3

Martijn Faasen suggested this solution in a comment on my previous post and I think it’s the best.  I created a new service:

http://packages.willowrise.org

I simply posted a patched ZODB3 source distribution on a virtual-hosted server.  The first tarball, “ZODB3-3.8.1-polling-serial.tar.gz”, includes both the invalidation polling patch and the framework I created for plugging in data serialization formats other than pickles, but in the near future I plan to also post distributions with just the polling patch and some eggs for Windows users.

It would not make sense for me to post the patched tarballs and eggs on PyPI because I don’t want people to pull these patched versions accidentally.  Pulling these needs to be an explicit step.

Thanks to setuptools and zc.buildout, it turns out that creating a Python code distribution server is a piece of cake.  The buildout process scans the HTML pages on distribution servers for <a> links.  Any link that points to a tarball or egg with a version number is considered a candidate.  A static web site can easily fulfill these requirements. I imagine it gets deeper than that, but for now, that’s all I need.

To use this tarball, buildout.cfg just needs to include lines something like:

[buildout]
find-links = http://packages.willowrise.org
versions = versions

[versions]
ZODB3 = 3.8.1-polling-serial

zc.buildout does the rest.

It took a while to find this solution because, upon encountering the need to distribute patched eggs, I guessed it would be difficult to set up and maintain my own package distribution server. I also guessed setuptools had no support for patches in its versioning scheme. I’m glad I was completely wrong.

By the way, Ian suggested pip as a solution, but I don’t yet see how it helps. I am interested. I hope to see more of pip on Ian’s great blog.

Egg Patching Solution #2

I’ve been thinking more about patching Python eggs.  All I really need is for buildout.cfg to use a patched egg.  It doesn’t matter when the patching happens (although monkey patching is unacceptable; the changes I’m making are too complex for that.)  So the buildout process should download an egg that has already been patched.  That solution is probably less error-prone anyway.

So, I could create a “ZODB3-polling” egg that contains ZODB 3.8.1 with the invalidation polling patch, then upload that to PyPI.  All I have to do is tell people how to change their buildout.cfg to use my egg in place of the ZODB3 egg.

Ah, but there’s trouble: the ZODB3 egg is pulled in automatically through egg dependencies.  If people simply add my new egg to their buildout.cfg, they will end up with two ZODB versions in the Python path at once.  Which one wins?!

Therefore, it seems like zc.buildout should have a way to express, in buildout.cfg, “any requirement for egg X should instead be satisfied by egg Y”.  I am going to study how that might be done.

Poaching (Patching) Eggs

The term “egg” as used in the Python community seems so whimsical.  It deserves lots of puns.  A couple of weeks ago, I made a little utility for myself that takes all the eggs from an egg farm produced by zc.buildout and makes a single directory tree full of Python packages and modules.  I called it Omelette.  Get it?  Ha!  (I can hear chickens groaning already…)  The surprising thing about Omelette is it typically finishes in less than 1 second, even with dozens of eggs and thousands of modules.  It mostly produces symlinks, but it also unpacks zip files.  I plan to share it, but I don’t know when I’ll get around to packaging it.

Anyway, I want to talk about poaching patching eggs.  As systems grow in complexity, patching becomes more important.  Linux distributors, for example, solve a really complex problem, and their solution is to patch nearly every package.  If they didn’t, installed systems would be an unstable and unglued mess.  I imagine distributors’ patches usually reach upstream maintainers, but I also imagine it often takes months or years for those patches to trickle into a stable release of each package.

I really want to find a good way to integrate patching into the Python egg installation process.  I want to be able to say, in package metadata, that my package requires ZODB with a certain patch.  That patch would take the form of a diff file that might be downloaded from the Web.  I also want to be able to say that another package requires ZODB with a different patch, and assuming those patches have no conflicts, I want the Python package installation system to install ZODB with both patches.  Moreover, I want other buildouts to use ZODB without patches, even though I have a centralized cache of eggs in my ~/.buildout directory.

So let’s say my Python package installation system is zc.buildout, setuptools, and distutils.  Which layer should be modified to support patching?  I don’t think the need for automated patching arises until you’re combining a lot of packages, so it would seem most natural to put patching in zc.buildout. I can imagine a build.cfg like this:

[versions]
ZODB3=3.8.1 +poll-invalidations

[patches]
poll-invalidations=http://example.com/path/poll-invalidations-3.8.1.diff

I wonder how difficult it would be to achieve that.  Some modification of setuptools might be required.  Alternatively, can Paver patch eggs?  I suspect Paver is not very good at patching eggs either.

ZODB + Protobuf Looking Good

Today I drafted a module that mixes ZODB with Protocol Buffers.  It’s looking good!  I was hoping not to use metaclasses, but metaclasses solved a lot of problems, so I’ll keep them unless they cause deeper problems.

The code so far lets you create a class like this:

class Tower(Persistent):
    __metaclass__ = ProtobufState
    protobuf_type = Tower_pb

    def __init__(self):
        self.name = "Rapunzel's"
        self.height = 1000.0
        self.width = 10.0

    def __str__(self):
        return '%s %fx%f %d' % (
            self.name, self.height, self.width, self.build_year)

The only special lines are the second and third.  The second line says the class should use a metaclass that causes all persistent state to be stored in a protobuf message.  The third line says which protobuf class to use.  The protobuf class comes from a generated Python module (not shown).

If you try to store a string in the height attribute, Google’s code raises an error immediately, which is probably a nice new benefit.  Also, this Python code does not have to duplicate the content of the .proto file.  Did you notice the build_year attribute?  Nothing sets it, but since it’s defined in the message schema, it’s possible to read the default value at any time.  Here is the message schema, by the way:

message Tower {
    required string name = 1;
    required float width = 2;
    required float height = 3;
    optional int32 build_year = 4 [default = 1900];
}

The metaclass creates a property for every field defined in the protobuf message schema.  Each property delegates the storage of an attribute to a message field.  The metaclass also adds __getstate__, __setstate__, and __new__ methods to the class; these are used by the pickle module.  Since ZODB uses the pickle module, the serialization of these objects in the database is now primarily a protobuf message.  Yay!  I accomplished all of this with only about 100 lines of code, including comments. 🙂

I have a solution for subclassing that’s not beautiful, but workable.  Next, I need to work on persistent references and _p_changed notification.  Once I have those in place, I intend to release a pre-alpha and change ZODB to remove the pickle wrapper around database objects that are protobuf messages.  At that point, when combined with RelStorage, ZODB will finally store data in a language independent format.

As a matter of principle, important data should never be tied to any particular programming language.  Still, the fact that so many people use ZODB even without language independence is a testament to how good ZODB is.

Jonathan Ellis pointed out Thrift, an alternative to protobuf.  Thrift looks like it could be good, but my customer is already writing protobuf code, so I’m going to stick with that for now.  I suspect almost everything I’m doing will be be applicable to Thrift anyway.

I know what some of you must be thinking: what about storing everything in JSON like CouchDB?  I’m sure it could be done, but I don’t yet see a benefit.  Maybe someone will clue me in.