Thoughts on Beaker

October 01, 2011 at 12:01 PM | Code, Mako/Pylons

Beaker is a widely used caching and HTTP session library published by Ben Bangert. It dates back to the early days of Pylons, when Pylons moved off of the Myghty base to the Python Paste infrastructure. Beaker's guts in fact were originally the internals to the caching and session system of Myghty itself, which in turn was loosely based on the caching code from HTML::Mason (which if those two colons don't look familiar means we've made the jump across the ages to Perl).

The key neato thing nestled deep inside of Beaker is what I later was told is called a "dogpile lock". This is a simple, but slightly less simple than dict.get() type of system whereby the cache can put one thread to work on generating a new value for a certain cache key, while all the other threads and even other processes can all continue to return the expired value, until the new value is ready. I spent weeks getting this aspect to do what it was supposed to. It does things in a slightly more complex but comprehensive way than Mason did; it coordinates workers using a mutex and/or lockfiles, instead of a simple counter that estimates the creation time for a new value.

The storage backends are implemented in terms of a "container", which deals with an individual key, including checking if it's expired, regenerating it, and so forth, and a "namespace", which is the entryway to storing the data. It deals with keys and values like everything else does, and has backends for pickled files, DBM files, relational databases, memcached, and GAE. The general idea here came from Myghty but has been highly modified since then.

Beaker then adds something of a coarse-grained facade over this model, including something of a generic get/put API as well as a system of function decorators that cache the return value of a function. In recent releases it also supports the notion of a "region", which is just a way to package up a particular cache configuration under a single name that can later be referenced by high level caching operations.

Beaker also implements an HTTP session object on top of the same storage backends, taking this from Myghty as well. Implementing HTTP sessions on top of the cache backends was a completely off-the-cuff idea in Myghty, and over the years in Beaker it's had to be wrangled and re-wrangled over basic issues resulting from this mismatch.

As far as the backends themselves, at this point all of them have accumulated a fair degree of woe. The file- and DBM- backends basically perform pretty poorly (DBM maybe not as much). The memcached backend has always been problematic - we slowly learned that the cmemcache library is basically a non starter, and that pylibmc is dramatically faster than the usual memcache library. The memcached backend slowly sprouted an awkward way to select among these backends and more awkwardness to attempt to accommodate pylibmc's special concurrency API. Then when users started combining the memcached backend with HTTP sessions, all kinds of new issues occurred, namely that any particular key might be removed from memcached at any time, that caused us to rework the session implementation even more awkwardly to store the whole session under one "key", defeating the purpose of how the session object works with other backends. As for the SQLAlchemy backends, there are two, and I've no idea why.

The HTTP session as a "bag of random keys", to paraphrase a term I think PJE originally used, is not something I really do these days, as I'm pretty skeptical of the approach of storing a bag of key/value pairs inside of a big serialized BLOB as a solution to anything, really. A few keys in a cookie-based session of very limited size is fine. But a backend-oriented system where you're loading up a giant bag of dozens of keys referencing full object structures inside of a good sized serialized blob inside of a file, database, or memcached-type of system, is both horribly inefficient and a sign of sloppy, ad-hoc coding.

Beaker eventually provided a purely client side encrypted cookie session as an alternative to the server-based sessions. This session implementation stores all the data in an encrypted string (Ben spent a lot of time and got a lot of advice getting the encryption right here) which is stored completely on the client. All of the performance, scalability and failover issues introduced by typical memory or file based HTTP session implementations is solved immediately. So this session implementation, I use lots. I only put a few critical keys in it to track the user, but anything that is more significant is part of the object model which is persisted in the database as a first class object.

The only downside to the cookie-only session is that, unless you coordinate the state of the session with server-side state (which IMHO you should), you can't "force" it to expire sooner than it normally would, or otherwise prevent the replay of previous state, within an attack scenario. While an attacker can't craft a session, (s)he can re-use the same cookie over and over again, and that can only be guarded against by comparing its state to that of a memo on the server; a single "generation" counter can achieve this. I.e. session comes in, generation is "5", the server says it should be "7", reject it. So it may be considered that this turns the encrypted cookie session into a fancy session cookie for a server-side session. But I am actually OK with that. I put very, very little into unstructured sessions.

What I'm getting at here is that I'm not super-interested in most of what Beaker has - a lot of the APIs are awkward, all of the storage backends are either mostly useless or slow and crufty, and I don't care much for the server-side session thing especially using backends that were designed for caching. Overall Beaker creates a coarse opacity on top of a collection of things that I think good application developers should be much more aware of. I really think developers should know how their tools work.

Beaker has some isolated features which I think are great. These are the dogpile lock, the encrypted client-side cookie session, the concept of "cache regions" whereby a set of cache configuration is referencable by a single name, and some nice function decorators - these allow you to apply caching to a function, similarly to:

@cached(region="file_based", namespace="somenamespace")
def return_some_data(x, y, z):
    # ...

I'm not sure what's in store for Beaker. But what I'm doing is:

  • Created Dogpile, which is just the "dogpile lock" portion of Beaker, cleaned up and turned into a very clear, succinct system which you can use to build your own caching implementation. The README includes a full example using Pylibmc.
  • In Mako, for either 0.5.1 or 0.6.0 (not sure yet, though I just put out 0.5.0 the other day for unrelated reasons) the built-in Beaker support is genericized into a plugin system. You can write your own plugins very easily for Mako using Dogpile or similar, if you want caching inside your templates that isn't based off of Beaker. The system supports entrypoints so if a Beaker2 comes out, it would publish a cache backend that Mako could then use.

So we have that. Then, for the community I would like to see:

  • The solution for the key/value backend (or not). Today, high-performance, sophisticated key/value stores are ubiquitous. We of course have memcached. We also have Riak and Redis which appear very feasible for temporary storage, I'm not sure if Tokyo Tyrant is still popular. And of course there's the more permanent-oriented systems like Mongo and Cassandra. I am sure that someone has built some kind of generic facade over these, and there are probably many. These are the key/value systems you should be using, either via a modern facade or one of the libraries directly; Beaker's system, which was originally built to store keys inside of dictionaries or pickled files, is absolutely nothing more than a quaint, historical novelty in comparison.
  • A new library that is just the encrpyted cookie session. I'd use this. It would be very nice for this portion of Beaker to be its own thing. Until then I may just rip out that part of Beaker and stick it in a lib/ folder somewhere in my work projects for internal use.
  • For server-side HTTP sessions, I really think people should be rolling solutions for this as needed, probably using the encrypted cookie session for client side state linked to first class model objects in the datastore (relational or non-relational, doesn't matter). If accessing the datastore is a performance issue, you'd be using caching for those data structures, the same way you'd cache data structures that are not specific to a user (probably based on primary key or on their originating query). In the end this is similar to HTTP sessions stored directly in the cache backend, except there's a well defined model layer in between.
  • People that really insist on the pattern of server-side HTTP sessions as bags of keys inside of serialized blobs on a server should be served by some new library that someone else can maintain. I think I am done with supporting this pattern and I think Ben may be as well.

The theme I've been pushing in my recent talk on SQLAlchemy and I think I'm pushing here, is that if you're writing a big application, one which you'll be spending months or years with, it is very much worth it to spend some time spending a day or so building up foundational patterns. I really have never understood the point of, "I just got the whole app up and running from start to finish on the plane!" Well great, you got the app up in three hours, now you can spend the next six months reworking the whole thing to actually scale (or if it was only a proof-of-concept, then why does it need a cache or even a database?). Did three hours for a quick and dirty implementation versus three days to do it right really improve your life within the scope of the total time spent? Breaking up Beaker into components and encouraging people to use patterns instead of relying upon opaque, pushbutton libraries is part of that idea.


SQLAlchemy - an Architectural Retrospective

September 25, 2011 at 12:01 PM | Code, SQLAlchemy, Talks

The video of my PyGotham talk, SQLAlchemy, an Architectural Retrospective is now available at pyvideo. This talk details my current thinking on SQLAlchemy philosophy and then proceeds through architectural overviews of several key areas, including the core operation of the Unit of Work. The accompanying slides are available in PDF form as well. Enjoy !


SQLAlchemy at PyGotham

September 16, 2011 at 12:01 PM | Code, SQLAlchemy, Talks

ORM Thoughts (in < 140 x 5 characters)

June 16, 2011 at 03:29 PM | Code, SQLAlchemy

When I wrote SQLAlchemy, a key intent was as a working rebuttal to the "ORM is vietnam" argument. (edit: for those who don't know where that phrase originates, it starts with this very famous post.)

This was achieved by: a. tweaking the key assumption that relational concepts must be hidden. They don't. Practicality beats purity.

And b. not underestimating the task, i.e. unit of work, cached collections and associations backed by eager loading (solving N+1 problem).

Hibernate does both, but suffers from the limitations of Java - complex, heavyhanded. Python is what makes SQLAlchemy possible.

Our users can now write super-SQL-fluent apps succinctly and quickly; that's all the proof one should need that ORM is definitely worth it.


Magic, a "New" ORM

May 17, 2011 at 07:36 PM | Code, SQLAlchemy

TL;DR - Use SQLAlchemy to create your own Magic.

It's new, and easy! That's why we call it what it is: Magic. A new ORM that keeps things simple. Let's dive in !

from magic import (
            Entity, one_to_many, many_to_one, many_to_many, string
        )

I like this so far ! What does a new model class look like ?

class Parent(Entity):
    children = one_to_many("Child", "child_id",
                            reverse="parent")

No. Way. That's it ? No tables and columns ? No foreign thingies ? Where's the meaningless boilerplate ?

class Child(Entity):
    parent = many_to_one("Parent", "child_id",
                            reverse="children")

    tags = many_to_many("Tag", "child_tag",
                                "child_id",
                                "tag_id")

class Tag(Entity):
    name = string(50)

OK that's a little more chatty but seriously zzzeek, don't you want some weird "==" signs in there ?

Entity.setup_database("sqlite://", create=True)

This is beginning to remind me of washing machines that also have a dryer built inside of them, or those TVs that have VCRs embedded inside the case.

t1, t2, t3 = Tag(name='t1'), Tag(name='t2'), Tag(name='t3')
Entity.session.add(Parent(
        children={
            Child(tags={t1, t2}),
            Child(tags={t1, t3}),
            Child()
        }))
Entity.session.commit()

p1 = Entity.session.query(Parent).first()
for child in p1.children:
    print child, [t.name for t in child.tags]

New-style sets! Hooray for Python.

And...that's it. Magic!

Would you want this ORM ? Or would you want a different one ? Well how does Magic work ? I'm pretty sure you can guess how it starts:

from sqlalchemy import (
            Column, ForeignKey, Table,
            Integer, String, create_engine
        )

There's the zzzeek we know ! Blah blah blah tables, constraints, boring things. Well we might as well get on with it:

from sqlalchemy.orm import (
            class_mapper, mapper, relationship,
            scoped_session, sessionmaker, configure_mappers
        )
from sqlalchemy.ext.declarative import declared_attr, declarative_base
from sqlalchemy import event
import re

I like how "re" is the honored guest of all that SQLAlchemy stuff.

@event.listens_for(mapper, "mapper_configured")
def _setup_deferred_properties(mapper, class_):
    """Listen for finished mappers and apply DeferredProp
    configurations."""

    for key, value in class_.__dict__.items():
        if isinstance(value, DeferredProp):
            value._config(class_, mapper, key)

And here we have our first docstring. What is this "event" you speak of ?

zzzeek says: That's the new thing in 0.7 ! You're going to get a lot of mileage out of it - everything that used to be extension this, listener that, all goes through event. And there's lots of new events added with more on the way.

For this one in particular, just like it says, anytime a new mapper appears, this thing is going to run and....work all the magic.

Well it was nice while it lasted, I guess it's about to get ugly huh.

Deep breath, just a slight pinch:

class DeferredProp(object):
    """A class attribute that generates a mapped attribute
    after mappers are configured."""

    def _setup_reverse(self, key, rel, target_cls):
        """Setup bidirectional behavior between two relationships."""

        reverse = self.kw.get('reverse')
        if reverse:
            reverse_attr = getattr(target_cls, reverse)
            if not isinstance(reverse_attr, DeferredProp):
                reverse_attr.property._add_reverse_property(key)
                rel._add_reverse_property(reverse)

class FKRelationship(DeferredProp):
    """Generates a one to many or many to one relationship."""

    def __init__(self, target, fk_col, **kw):
        self.target = target
        self.fk_col = fk_col
        self.kw = kw

    def _config(self, cls, key):
        """Create a Column with ForeignKey as well as a relationship()."""

        target_cls = cls._decl_class_registry[self.target]

        pk_target, fk_target = self._get_pk_fk(cls, target_cls)
        pk_table = pk_target.__table__
        pk_col = list(pk_table.primary_key)[0]

        if hasattr(fk_target, self.fk_col):
            fk_col = getattr(fk_target, self.fk_col)
        else:
            fk_col = Column(self.fk_col, pk_col.type, ForeignKey(pk_col))
            setattr(fk_target, self.fk_col, fk_col)

        rel = relationship(target_cls,
                primaryjoin=fk_col==pk_col,
                collection_class=self.kw.get('collection_class', set)
            )
        setattr(cls, key, rel)
        self._setup_reverse(key, rel, target_cls)

class one_to_many(FKRelationship):
    """Generates a one to many relationship."""

    def _get_pk_fk(self, cls, target_cls):
        return cls, target_cls

class many_to_one(FKRelationship):
    """Generates a many to one relationship."""

    def _get_pk_fk(self, cls, target_cls):
        return target_cls, cls

class many_to_many(DeferredProp):
    """Generates a many to many relationship."""

    def __init__(self, target, tablename, local, remote, **kw):
        self.target = target
        self.tablename = tablename
        self.local = local
        self.remote = remote
        self.kw = kw

    def _config(self, cls, key):
        """Create an association table between parent/target
        as well as a relationship()."""

        target_cls = cls._decl_class_registry[self.target]
        local_pk = list(cls.__table__.primary_key)[0]
        target_pk = list(target_cls.__table__.primary_key)[0]

        t = Table(
                self.tablename,
                cls.metadata,
                Column(self.local, ForeignKey(local_pk), primary_key=True),
                Column(self.remote, ForeignKey(target_pk), primary_key=True),
                keep_existing=True
            )
        rel = relationship(target_cls,
                secondary=t,
                collection_class=self.kw.get('collection_class', set)
            )
        setattr(cls, key, rel)
        self._setup_reverse(key, rel, target_cls)

That was highly unpleasant. Please don't paste that much code again.

zzzeek says: OK! It's just doing the foreign key and relationship() for us. If you've worked with straight SQLAlchemy before, most of what's in there shouldn't be too mysterious.

We're getting the "target" of the relationship using _decl_class_registry, a dictionary that gives us the target class based on the string, which is put there by Declarative. We're looking at the existing classes and their __table__ to get at the appropriate primary key (assumed to be non-composite.... a little more magic could certainly improve upon that though!), we create a Column() with ForeignKey() the way you'd normally be doing for all your mapped classes individually, or in the case of many-to-many we just put two of them into a Table. Then we send out a relationship() with what we've come up with. We can stick these attributes right on the classes and Declarative takes care of making sure they are mapped and such.

It's an entirely alternate form of relationship in just 80 lines - there's lots of ways to play with things like this. I personally don't need this much re-working of SQLAlchemy's usual relationship() syntax, and I think most of our users don't either - but the job first and foremost of relationship() is to have awesome functionality. I've seen some requests sometimes to make it do things like this, and one of our goals is to make whatever customizations people need as doable as possible. Patterns like these can change how the rest of your project looks. That one is pretty ambitious - but there's plenty of others that are a lot simpler, and can really cut down on noise throughout the bulk of mapping code:

def string(size):
    """Convenience macro, return a Column with String."""

    return Column(String(size))

def int():
    """Convenience macro, return a Column with Integer."""

    return Column(Integer)

Why thank you !

class Base(object):
    """Base class which auto-generates tablename, surrogate
    primary key column.

    Also includes a scoped session and a database generator.

    """
    @declared_attr
    def __tablename__(cls):
        """Convert CamelCase class name to underscores_between_words
        table name."""
        name = cls.__name__
        return (
            name[0].lower() +
            re.sub(r'([A-Z])', lambda m:"_" + m.group(0).lower(), name[1:])
        )

    id = Column(Integer, primary_key=True)
    """Surrogate 'id' primary key column."""

    @classmethod
    def setup_database(cls, url, create=False, echo=False):
        """'Setup everything' method for the ultra lazy."""

        configure_mappers()
        e = create_engine(url, echo=echo)
        if create:
            cls.metadata.create_all(e)
        cls.session = scoped_session(sessionmaker(e))

Entity = declarative_base(cls=Base)

Well now ! Why didn't you tell us you could do that before ? I've been putting __tablename__ and columns all over the place.

zzzeek says: We get into it to a good degree when we talk about "mixins" here , most of what mixins do can go on your "base" as well.

Alrighty. Short blog post today?

zzzeek says: Indeed. The moral of the story is, SQLAlchemy isn't a framework, and never was...it's a toolkit - you should build things !

Look for SQLAlchemy 0.7's production release soon, in the meantime here's some magic.