Thoughts on Beaker

October 01, 2011 at 12:01 PM | Code, Mako/Pylons

Beaker is a widely used caching and HTTP session library published by Ben Bangert. It dates back to the early days of Pylons, when Pylons moved off of the Myghty base to the Python Paste infrastructure. Beaker's guts in fact were originally the internals to the caching and session system of Myghty itself, which in turn was loosely based on the caching code from HTML::Mason (which if those two colons don't look familiar means we've made the jump across the ages to Perl).

The key neato thing nestled deep inside of Beaker is what I later was told is called a "dogpile lock". This is a simple, but slightly less simple than dict.get() type of system whereby the cache can put one thread to work on generating a new value for a certain cache key, while all the other threads and even other processes can all continue to return the expired value, until the new value is ready. I spent weeks getting this aspect to do what it was supposed to. It does things in a slightly more complex but comprehensive way than Mason did; it coordinates workers using a mutex and/or lockfiles, instead of a simple counter that estimates the creation time for a new value.

The storage backends are implemented in terms of a "container", which deals with an individual key, including checking if it's expired, regenerating it, and so forth, and a "namespace", which is the entryway to storing the data. It deals with keys and values like everything else does, and has backends for pickled files, DBM files, relational databases, memcached, and GAE. The general idea here came from Myghty but has been highly modified since then.

Beaker then adds something of a coarse-grained facade over this model, including something of a generic get/put API as well as a system of function decorators that cache the return value of a function. In recent releases it also supports the notion of a "region", which is just a way to package up a particular cache configuration under a single name that can later be referenced by high level caching operations.

Beaker also implements an HTTP session object on top of the same storage backends, taking this from Myghty as well. Implementing HTTP sessions on top of the cache backends was a completely off-the-cuff idea in Myghty, and over the years in Beaker it's had to be wrangled and re-wrangled over basic issues resulting from this mismatch.

As far as the backends themselves, at this point all of them have accumulated a fair degree of woe. The file- and DBM- backends basically perform pretty poorly (DBM maybe not as much). The memcached backend has always been problematic - we slowly learned that the cmemcache library is basically a non starter, and that pylibmc is dramatically faster than the usual memcache library. The memcached backend slowly sprouted an awkward way to select among these backends and more awkwardness to attempt to accommodate pylibmc's special concurrency API. Then when users started combining the memcached backend with HTTP sessions, all kinds of new issues occurred, namely that any particular key might be removed from memcached at any time, that caused us to rework the session implementation even more awkwardly to store the whole session under one "key", defeating the purpose of how the session object works with other backends. As for the SQLAlchemy backends, there are two, and I've no idea why.

The HTTP session as a "bag of random keys", to paraphrase a term I think PJE originally used, is not something I really do these days, as I'm pretty skeptical of the approach of storing a bag of key/value pairs inside of a big serialized BLOB as a solution to anything, really. A few keys in a cookie-based session of very limited size is fine. But a backend-oriented system where you're loading up a giant bag of dozens of keys referencing full object structures inside of a good sized serialized blob inside of a file, database, or memcached-type of system, is both horribly inefficient and a sign of sloppy, ad-hoc coding.

Beaker eventually provided a purely client side encrypted cookie session as an alternative to the server-based sessions. This session implementation stores all the data in an encrypted string (Ben spent a lot of time and got a lot of advice getting the encryption right here) which is stored completely on the client. All of the performance, scalability and failover issues introduced by typical memory or file based HTTP session implementations is solved immediately. So this session implementation, I use lots. I only put a few critical keys in it to track the user, but anything that is more significant is part of the object model which is persisted in the database as a first class object.

The only downside to the cookie-only session is that, unless you coordinate the state of the session with server-side state (which IMHO you should), you can't "force" it to expire sooner than it normally would, or otherwise prevent the replay of previous state, within an attack scenario. While an attacker can't craft a session, (s)he can re-use the same cookie over and over again, and that can only be guarded against by comparing its state to that of a memo on the server; a single "generation" counter can achieve this. I.e. session comes in, generation is "5", the server says it should be "7", reject it. So it may be considered that this turns the encrypted cookie session into a fancy session cookie for a server-side session. But I am actually OK with that. I put very, very little into unstructured sessions.

What I'm getting at here is that I'm not super-interested in most of what Beaker has - a lot of the APIs are awkward, all of the storage backends are either mostly useless or slow and crufty, and I don't care much for the server-side session thing especially using backends that were designed for caching. Overall Beaker creates a coarse opacity on top of a collection of things that I think good application developers should be much more aware of. I really think developers should know how their tools work.

Beaker has some isolated features which I think are great. These are the dogpile lock, the encrypted client-side cookie session, the concept of "cache regions" whereby a set of cache configuration is referencable by a single name, and some nice function decorators - these allow you to apply caching to a function, similarly to:

@cached(region="file_based", namespace="somenamespace")
def return_some_data(x, y, z):
    # ...

I'm not sure what's in store for Beaker. But what I'm doing is:

  • Created Dogpile, which is just the "dogpile lock" portion of Beaker, cleaned up and turned into a very clear, succinct system which you can use to build your own caching implementation. The README includes a full example using Pylibmc.
  • In Mako, for either 0.5.1 or 0.6.0 (not sure yet, though I just put out 0.5.0 the other day for unrelated reasons) the built-in Beaker support is genericized into a plugin system. You can write your own plugins very easily for Mako using Dogpile or similar, if you want caching inside your templates that isn't based off of Beaker. The system supports entrypoints so if a Beaker2 comes out, it would publish a cache backend that Mako could then use.

So we have that. Then, for the community I would like to see:

  • The solution for the key/value backend (or not). Today, high-performance, sophisticated key/value stores are ubiquitous. We of course have memcached. We also have Riak and Redis which appear very feasible for temporary storage, I'm not sure if Tokyo Tyrant is still popular. And of course there's the more permanent-oriented systems like Mongo and Cassandra. I am sure that someone has built some kind of generic facade over these, and there are probably many. These are the key/value systems you should be using, either via a modern facade or one of the libraries directly; Beaker's system, which was originally built to store keys inside of dictionaries or pickled files, is absolutely nothing more than a quaint, historical novelty in comparison.
  • A new library that is just the encrpyted cookie session. I'd use this. It would be very nice for this portion of Beaker to be its own thing. Until then I may just rip out that part of Beaker and stick it in a lib/ folder somewhere in my work projects for internal use.
  • For server-side HTTP sessions, I really think people should be rolling solutions for this as needed, probably using the encrypted cookie session for client side state linked to first class model objects in the datastore (relational or non-relational, doesn't matter). If accessing the datastore is a performance issue, you'd be using caching for those data structures, the same way you'd cache data structures that are not specific to a user (probably based on primary key or on their originating query). In the end this is similar to HTTP sessions stored directly in the cache backend, except there's a well defined model layer in between.
  • People that really insist on the pattern of server-side HTTP sessions as bags of keys inside of serialized blobs on a server should be served by some new library that someone else can maintain. I think I am done with supporting this pattern and I think Ben may be as well.

The theme I've been pushing in my recent talk on SQLAlchemy and I think I'm pushing here, is that if you're writing a big application, one which you'll be spending months or years with, it is very much worth it to spend some time spending a day or so building up foundational patterns. I really have never understood the point of, "I just got the whole app up and running from start to finish on the plane!" Well great, you got the app up in three hours, now you can spend the next six months reworking the whole thing to actually scale (or if it was only a proof-of-concept, then why does it need a cache or even a database?). Did three hours for a quick and dirty implementation versus three days to do it right really improve your life within the scope of the total time spent? Breaking up Beaker into components and encouraging people to use patterns instead of relying upon opaque, pushbutton libraries is part of that idea.