Promoting the Zope Component Architecture

I just sent this bit of promotion for the Zope Component Architecture to a friend (paraphrased slightly):

The Zope community uses adaptation for a new purpose. While adaptation is classically a way to force uncooperative classes to communicate, that is not the intent of the component architecture.

The intent of adapters in Zope is to expand objects’ contracts to fulfill application requirements while not polluting reusable code.  Adapters act like a streamlined form of model/view separation.

The community’s experience with Zope 2 was that while we did our best to keep code modular and reusable, there were simply too many cases where we needed to change the behavior of a base component in order to fulfill application requirements. We badly needed a way to use some kind of model/view separation at any point in the code.

Using Zope 2, we tried several industry solutions to address this, but using any of those solutions, we still found that application-specific dependencies had to creep into otherwise reusable code in order to meet reasonable deadlines. That really hurt too, because we were a bunch of hard-core OO developers and we really hated breaking modularity. We knew we were incurring a long term debt.

The Zope component architecture is the little gem that resulted from that long experience. It helps programmers avoid creating application-specific dependencies at every level. The component architecture resembles other indirection frameworks like Spring and AOP, but I believe it solves more problems elegantly.

That said, the Zope community now has enough experience with the component architecture to know that the first time we applied it in Zope 3, we applied too much of it in some places. Thus Zope 3 is currently somewhat overgeneralized. Like any indirection framework, you have to gain some experience before you learn what indirections are appropriate.

How am I doing? I think the ZCA would be quite valuable for my friend, who is today an excellent Java designer and coder. I want to find the right words to express why it would be valuable to him.

Easy Workaround for zc.buildout

Problem: running “python bootstrap.py” or “bin/buildout” often produces scripts that mix up the Python package search path due to some packages being installed system-wide.  Version conflicts result.

Workaround: use “python -S bootstrap.py” and “python -S bin/buildout”.  Magically, no more version conflicts.

I wish I had thought of that before.  Duh!

Update: Another tip for new zc.buildout users I’ve been meaning to mention is that you should create a preferences file in your home directory so that downloads and eggs are cached centrally.  This makes zc.buildout much friendlier.  Do this:

mkdir ~/.buildout
echo "[buildout]" >> ~/.buildout/default.cfg
echo "eggs-directory = $HOME/.buildout/eggs" >> ~/.buildout/default.cfg
echo "download-cache = $HOME/.buildout/cache" >> ~/.buildout/default.cfg

It seems a bit silly that zc.buildout doesn’t have these settings by default.  They make zc.buildout behave a lot like Apache Maven, which is what a lot of Java shops are using these days.  Both zc.buildout and Maven are great tools once you get to know them, but both are a real pain to understand at first.

What Would ZODB + Paxos Look Like?

I just learned about the Paxos algorithm. I think we might be able to use it to create a fully distributed version of ZODB. I found a document that explains Paxos in simple terms.  Now I’m interested in learning about any ideas and software that might support integration of Paxos into ZODB.  I would also like to know how much interest people have in such a project.

I think ZODB’s transaction layer already implements a sort of squashed version of Paxos, but it’s not currently possible to separate the pieces to make it distributed.  To me, “distributed ZODB” means multiple servers accept writes while assuring consistency at all times.  I also require sub-millisecond response timing on the majority of read operations, since that is what ZODB applications have come to rely upon.  I suspect the speed requirement disqualifies systems like CouchDB.

Egg Patching Solution #3

Martijn Faasen suggested this solution in a comment on my previous post and I think it’s the best.  I created a new service:

http://packages.willowrise.org

I simply posted a patched ZODB3 source distribution on a virtual-hosted server.  The first tarball, “ZODB3-3.8.1-polling-serial.tar.gz”, includes both the invalidation polling patch and the framework I created for plugging in data serialization formats other than pickles, but in the near future I plan to also post distributions with just the polling patch and some eggs for Windows users.

It would not make sense for me to post the patched tarballs and eggs on PyPI because I don’t want people to pull these patched versions accidentally.  Pulling these needs to be an explicit step.

Thanks to setuptools and zc.buildout, it turns out that creating a Python code distribution server is a piece of cake.  The buildout process scans the HTML pages on distribution servers for <a> links.  Any link that points to a tarball or egg with a version number is considered a candidate.  A static web site can easily fulfill these requirements. I imagine it gets deeper than that, but for now, that’s all I need.

To use this tarball, buildout.cfg just needs to include lines something like:

[buildout]
find-links = http://packages.willowrise.org
versions = versions

[versions]
ZODB3 = 3.8.1-polling-serial

zc.buildout does the rest.

It took a while to find this solution because, upon encountering the need to distribute patched eggs, I guessed it would be difficult to set up and maintain my own package distribution server. I also guessed setuptools had no support for patches in its versioning scheme. I’m glad I was completely wrong.

By the way, Ian suggested pip as a solution, but I don’t yet see how it helps. I am interested. I hope to see more of pip on Ian’s great blog.

Egg Patching Solution #2

I’ve been thinking more about patching Python eggs.  All I really need is for buildout.cfg to use a patched egg.  It doesn’t matter when the patching happens (although monkey patching is unacceptable; the changes I’m making are too complex for that.)  So the buildout process should download an egg that has already been patched.  That solution is probably less error-prone anyway.

So, I could create a “ZODB3-polling” egg that contains ZODB 3.8.1 with the invalidation polling patch, then upload that to PyPI.  All I have to do is tell people how to change their buildout.cfg to use my egg in place of the ZODB3 egg.

Ah, but there’s trouble: the ZODB3 egg is pulled in automatically through egg dependencies.  If people simply add my new egg to their buildout.cfg, they will end up with two ZODB versions in the Python path at once.  Which one wins?!

Therefore, it seems like zc.buildout should have a way to express, in buildout.cfg, “any requirement for egg X should instead be satisfied by egg Y”.  I am going to study how that might be done.

Poaching (Patching) Eggs

The term “egg” as used in the Python community seems so whimsical.  It deserves lots of puns.  A couple of weeks ago, I made a little utility for myself that takes all the eggs from an egg farm produced by zc.buildout and makes a single directory tree full of Python packages and modules.  I called it Omelette.  Get it?  Ha!  (I can hear chickens groaning already…)  The surprising thing about Omelette is it typically finishes in less than 1 second, even with dozens of eggs and thousands of modules.  It mostly produces symlinks, but it also unpacks zip files.  I plan to share it, but I don’t know when I’ll get around to packaging it.

Anyway, I want to talk about poaching patching eggs.  As systems grow in complexity, patching becomes more important.  Linux distributors, for example, solve a really complex problem, and their solution is to patch nearly every package.  If they didn’t, installed systems would be an unstable and unglued mess.  I imagine distributors’ patches usually reach upstream maintainers, but I also imagine it often takes months or years for those patches to trickle into a stable release of each package.

I really want to find a good way to integrate patching into the Python egg installation process.  I want to be able to say, in package metadata, that my package requires ZODB with a certain patch.  That patch would take the form of a diff file that might be downloaded from the Web.  I also want to be able to say that another package requires ZODB with a different patch, and assuming those patches have no conflicts, I want the Python package installation system to install ZODB with both patches.  Moreover, I want other buildouts to use ZODB without patches, even though I have a centralized cache of eggs in my ~/.buildout directory.

So let’s say my Python package installation system is zc.buildout, setuptools, and distutils.  Which layer should be modified to support patching?  I don’t think the need for automated patching arises until you’re combining a lot of packages, so it would seem most natural to put patching in zc.buildout. I can imagine a build.cfg like this:

[versions]
ZODB3=3.8.1 +poll-invalidations

[patches]
poll-invalidations=http://example.com/path/poll-invalidations-3.8.1.diff

I wonder how difficult it would be to achieve that.  Some modification of setuptools might be required.  Alternatively, can Paver patch eggs?  I suspect Paver is not very good at patching eggs either.

ZODB + Protobuf Looking Good

Today I drafted a module that mixes ZODB with Protocol Buffers.  It’s looking good!  I was hoping not to use metaclasses, but metaclasses solved a lot of problems, so I’ll keep them unless they cause deeper problems.

The code so far lets you create a class like this:

class Tower(Persistent):
    __metaclass__ = ProtobufState
    protobuf_type = Tower_pb

    def __init__(self):
        self.name = "Rapunzel's"
        self.height = 1000.0
        self.width = 10.0

    def __str__(self):
        return '%s %fx%f %d' % (
            self.name, self.height, self.width, self.build_year)

The only special lines are the second and third.  The second line says the class should use a metaclass that causes all persistent state to be stored in a protobuf message.  The third line says which protobuf class to use.  The protobuf class comes from a generated Python module (not shown).

If you try to store a string in the height attribute, Google’s code raises an error immediately, which is probably a nice new benefit.  Also, this Python code does not have to duplicate the content of the .proto file.  Did you notice the build_year attribute?  Nothing sets it, but since it’s defined in the message schema, it’s possible to read the default value at any time.  Here is the message schema, by the way:

message Tower {
    required string name = 1;
    required float width = 2;
    required float height = 3;
    optional int32 build_year = 4 [default = 1900];
}

The metaclass creates a property for every field defined in the protobuf message schema.  Each property delegates the storage of an attribute to a message field.  The metaclass also adds __getstate__, __setstate__, and __new__ methods to the class; these are used by the pickle module.  Since ZODB uses the pickle module, the serialization of these objects in the database is now primarily a protobuf message.  Yay!  I accomplished all of this with only about 100 lines of code, including comments. 🙂

I have a solution for subclassing that’s not beautiful, but workable.  Next, I need to work on persistent references and _p_changed notification.  Once I have those in place, I intend to release a pre-alpha and change ZODB to remove the pickle wrapper around database objects that are protobuf messages.  At that point, when combined with RelStorage, ZODB will finally store data in a language independent format.

As a matter of principle, important data should never be tied to any particular programming language.  Still, the fact that so many people use ZODB even without language independence is a testament to how good ZODB is.

Jonathan Ellis pointed out Thrift, an alternative to protobuf.  Thrift looks like it could be good, but my customer is already writing protobuf code, so I’m going to stick with that for now.  I suspect almost everything I’m doing will be be applicable to Thrift anyway.

I know what some of you must be thinking: what about storing everything in JSON like CouchDB?  I’m sure it could be done, but I don’t yet see a benefit.  Maybe someone will clue me in.

Contemplating Integration of Protocol Buffers into ZODB

I am now looking at integrating Google’s Protocol Buffers into ZODB.  This means that wherever possible, ZODB should store a serialized protocol buffer rather than a pickle.  The main thing my customer hopes to gain is language independence, but storing protocol buffers could open other possibilities as well.  I think this is a very good idea and I’m going to pursue it.

Although ZODB has always clung tightly to Python pickles, I don’t think that moving to a different serialization format violates any basic ZODB design principle.  On the other hand, I have tried to change the serialization format before; in particular, the APE project tried to make the serialization format completely pluggable.  The APE project turned out not to be viable.  Therefore, this project must not repeat the mistake that broke APE.

Why did APE fail?  APE was not viable because it was too hard to debug.  It was too hard to debug because storage errors never occurred near their source, so tracebacks were never very helpful.  The only people who could solve such problems were those who were intimately familiar with the application, APE, and ZODB, all at the same time.  The storage errors indicated that some application code was trying to store something the serialization layer did not like.  The serialization layer had no ability to voice its objections until transaction commit, at which time the incompatible application code was no longer on the stack (and might even be on a different machine if ZEO was involved).

This turned out to be a much more serious problem than I anticipated.  It made me realize that one of the top design concerns for large applications is producing high quality stack tracebacks in error messages.  A quality traceback can pinpoint the cause of an error in seconds.  I am not aware of any substitute for quality tracebacks, so I am now willing to sacrifice a lot to get them.

So I am faced with a basic design choice.  Should protocol buffers be integrated into ZODB at the application layer, rather than behind a storage layer like APE did?  If I choose to take this route, Persistent subclasses will need to explicitly store a protocol buffer.  Storage errors will indeed occur mostly at their source, since the protocol buffer classes will check validity immediately.

Now that I’ve written this out, the right choice seems obvious: the main integration should indeed be done at the application layer.  Until now, it was hard to distinguish this issue from other issues like persistent references and the format of database records.

Furthermore, the simplest thing to do at first is to store a protocol buffer object as an attribute of a normal persistent object, rather than my initial idea of creating classes that join Persistent with Google’s generated classes.  That means we will still store a pickle, but the pickle will contain a serialized protocol buffer.  Later on, I will figure out how to store a protocol buffer without a pickle surrounding it.  I will also provide a method of storing persistent references, though it might be different from the method ZODB users are accustomed to.

Bootstrap.py versus pkg_resources.py

I’ve been using zc.buildout quite a bit over the past month.  Although it has been working, it has been doing strange things like using the wrong version of zope.interface.  Yesterday I finally figured out why, and today I found a possible solution.

It turns out that Ubuntu (8.10) provides a package called python-pkg-resources.  At least one Ubuntu package (Snowballz, a strategy game written in Python) pulls in that package automatically.  It installs a pkg_resources module in Python’s site-packages directory, but it does not install the rest of setuptools.

I can understand why Ubuntu chose to split up setuptools, but that choice causes havoc for the bootstrap.py module people use to install zc.buildout.  Here is what bootstrap.py is supposed to do:

  1. Download ez_setup.py and run it.
  2. ez_setup tries to import the pkg_resources module, but fails.
  3. The setuptools package is not found, so ez_setup downloads setuptools in a temporary directory.
  4. ez_setup alters sys.path to include the new setuptools package.
  5. bootstrap.py imports the pkg_resources module from the version of setuptools just downloaded.
  6. Ask pkg_resources about the installed setuptools package.
  7. Use setuptools to install zc.buildout.

Here is what bootstrap.py actually does when pkg_resources.py exists in the site-packages directory (differences emphasized):

  1. Download ez_setup.py and run it.
  2. ez_setup successfully imports the pkg_resources module from site-packages.
  3. The setuptools package is not found, so ez_setup downloads setuptools in a temporary directory.
  4. ez_setup alters sys.path to include the new setuptools package.
  5. boostrap.py continues to use the previously imported pkg_resources module.
  6. Ask pkg_resources about the installed setuptools package.
  7. pkg_resources does not find setuptools because pkg_resources does not notice the change to sys.path.  bootstrap.py fails.

At first, following ideas I gleaned from various posts about zc.buildout, I worked around this by deleting the setuptools egg and the pkg_resources module from site-packages.  I didn’t know exactly why this helped until I studied the problem.  It turns out that bootstrap.py was just not written to cope with a system-wide installation of pkg_resources.

Now I think I recognize another bad choice that zc.buildout has been making.  zc.buildout generates a “bin” directory full of Python scripts.  Those scripts prepend egg directories and egg zip files to sys.path before doing their work.  I noticed that sometimes the list of paths to prepend includes “/usr/lib/python2.5/site-packages”, which is already on sys.path.  I now suspect that whenever zc.buildout includes paths like that, it’s wrong, and the cause is a mixup involving a system-wide installation of pkg_resources, setuptools, or some other foundational package.

Here is a possible way to fix bootstrap.py.  Just before the “import pkg_resources” line, add this:

del sys.modules[‘pkg_resources’]

This solved the bootstrap.py problem for me.  Altering sys.modules is rarely a good idea, but this might be a good exception to the rule.  I don’t believe we need to catch KeyError because ez_setup should have imported pkg_resources already.

Beyond this, there is probably more work to do to make zc.buildout produce correct scripts.

Whoever said computers behave logically must have been joking or delusional.  The people who provide the software never fully agree with each other–nor even themselves!

RelStorage 1.1.1 Released

For those just tuning in: RelStorage is an adapter for ZODB (the database behind Zope) that stores data in a relational database such as MySQL, PostgreSQL, or Oracle.  It has advantages over FileStorage and ZEO, the default storage methods.  Read more in the ZODB wiki.

Download this release from PyPI.  As I mentioned in another post, this release works around MySQL performance bugs to make it possible to pack large databases in a reasonable amount of time.

If you are upgrading from previous versions, a simple schema migration is required.  See the migration instructions.