Value Agnostic Types, Part II

October 29, 2011 at 12:01 PM | Code, SQLAlchemy

In last week's post I illustrated how to use a "value agnostic type" in conjunction with an ORM-mapped class, using the @hybrid_proprerty decorator. Pressed for details about the hypothetically better TypeDecorator approach, we'll illustrate that here. The TypeDecorator approach as I've mentioned has a few non-intuitive edges, though after working out the example they're not so bad. I do in fact use the TypeDecorator approach in my day job as often as I use the pure hybrid approach.

On SQL Objects and Types

The aspect of the approach I'm a little unsatisfied with is that SQLAlchemy currently doesn't provide a first class pathway to linking operators with the types they operate upon. This is a situation I'd like to improve in an upcoming release, but let's see what the general idea is.

SQLAlchemy separates the concerns of a "SQL-representation" object and a "type-representation" object. That is, when you deal with a Column associated with a table, it is distinct from its type:

someint = Column('int_col', Integer)
somestring = Column('str_col', String)

where above, we have two Column objects. Each has a type attribute, which refers to the SQL type of the column. Above, these are illustrated as Integer and String.

I've seen other systems that do things more directly:

someint = Integer("int_col")
somestring = String("str_col")

That approach is at first simpler, since you can now put special operators upon Integer and String specific to those types, thereby linking Python's typing behavior directly to that of the database. However this fails to model how the relational database actually works. Suppose we derive new SQL constructs from someint and somestring, a "label" and a "scalar subquery", respectively:

label_of_someint = someint.label("some label")
subquery = select([somestring]).as_scalar()

Now what's the type-dependent behavior of those two objects? In this case, the rationale for SQLAlchemy's approach is clear - each has a type attribute copied straight from the origin expression, which applies the same behavior to the derived construct as the original. On the other hand, the "expression-is-the-type" system would in theory require IntegerLabel, StringLabel, IntegerSubquery, StringSubquery classes - an explosion of subclasses and behaviors awkwardly linked together. Welding the datatype to the SQL expression type doesn't really work.

SQLAlchemy's approach is more comprehensive, but at the moment only has limited hooks into the type when an expression is generated. When an expression is constructed, the type is consulted specifically for a few things, including coercion of the operator, such as coercing the add() operator to the concat() operator in the case of strings, as well as hints on what the return type of the expression should be, and what kind of type the "other" side should be treated as, if not already known. It doesn't yet have a way to add new operators to the SQL construct which contains the type, or to replace a value with a SQL expression - something that would in particular help a lot with things like PostGIS. These problems are entirely solvable, however, and may occur in a future release.

As it turns out, these limitations aren't terrible in the case of our Amount type, and we'll use the TypeDecorator to create a new "currency" type that will work just as well without the ORM, that is with plain SQL expressions, as with it. ORM-mapped classes no longer will need to use @hybrid_property to coerce columns into the type. And we'll even get Amount objects back directly in result sets. Overall, after I tried this out, I realized it's definitely the better way to go.

Reintroducing the Amount type

We'll reuse the Amount type from last week. It's mostly the same, except we'll remove __clause_element__(), and additionally provide a classmethod version of the as_currency() method, which we'll use when dealing with SQL expressions. Because a SQLAlchemy custom type can't export new operations to the enclosing SQL construct, special operations on currency types will be done via this function called externally to the target, rather than a method call from the SQL expression construct itself:

from sqlalchemy.ext.hybrid import hybrid_method
from sqlalchemy import type_coerce, Numeric

class Amount(object):
    def __init__(self, amount, currency):
        self.currency = currency
        self.amount = amount

    def __add__(self, other):
        return Amount(
                self.amount +
                other.as_currency(self.currency).amount,
                self.currency
            )

    def __sub__(self, other):
        return Amount(
                self.amount -
                other.as_currency(self.currency).amount,
                self.currency
            )

    def __lt__(self, other):
        return self.amount < other.as_currency(self.currency).amount

    def __gt__(self, other):
        return self.amount > other.as_currency(self.currency).amount

    def __eq__(self, other):
        return self.amount == other.as_currency(self.currency).amount

    @hybrid_method
    def as_currency(self, other_currency):
        return Amount(
                    currency_lookup[(self.currency, other_currency)] * self.amount,
                    other_currency)

    @as_currency.expression
    def as_currency(self, target, other_currency):
        if hasattr(target, '__clause_element__'):
            target = target.__clause_element__()
        currency = target.type.currency
        return Amount(
                    currency_lookup[(currency, other_currency)]
                     * type_coerce(target, Numeric),
                    other_currency)

    def __str__(self):
        return "%s %s" % (self.amount, self.currency)

    def __repr__(self):
        return "Amount(%s)" % self

The basic thing we've added above is a class level version of as_currency(), which accepts a target and a currency. This would be used like:

Amount.as_currency(mytable.c.balance, "usd")

This method needs to take a peek at what it gets to see if it's directly a SQL expression, or an ORM-oriented expression that needs the __clause_element__() method to resolve down to the SQL expression. Consider this to be "rough edge" number one, if you will, though there's not much to it.

Another special step in that method is that we're using type_coerce(value, Numeric) to coerce the target, which will be a SQL expression using our new TypeDecorator, so that it's treated as a Numeric when multiplying by the exchange rate. If we didn't do that, the type coercion rule used by our type would take effect and assume the exchange rate is also an Amount type, which is not the case. While that coercion rule can be customized to handle an incoming Decimal value, it's more appropriately left as is, since expressions against Amount should always be using another Amount object.

Oddly enough we used @hybrid_method to provide the "dual purposed" behavior of as_currency() at the instance level and at the class level. Nothing to do with the ORM here, just a handy place to use it.

Our TypeDecorator is then pretty much from the docs:

from sqlalchemy import TypeDecorator, Numeric

class AmountType(TypeDecorator):
    impl = Numeric

    def __init__(self, currency):
        self.currency = currency
        super(AmountType, self).__init__()

    def process_bind_param(self, value, dialect):
        if value is not None:
            value = value.as_currency(self.currency).amount
        return value

    def process_result_value(self, value, dialect):
        if value is not None:
            value = Amount(value, self.currency)
        return value

Now if we modify our mapping from last week, we can take out the hybrid, and just use AmountType directly:

class BankAccount(Base):
    __tablename__ = 'bank_account'
    id = Column(Integer, primary_key=True)

    balance = Column('balance', AmountType("usd"))

    def __repr__(self):
        return "BankAccount(%s)" % self.balance

And that's it ! What changes with this usage, and what stays the same ? Most simple things stay the same, such as creating new BankAccount objects:

session.add_all([
    account,
    BankAccount(balance=Amount(10000, "cad")),
    BankAccount(balance=Amount(5700, "usd")),
    BankAccount(balance=Amount(89682, "aud")),
])

as well as basic comparison operations:

>>> print session.query(BankAccount).\
...    filter(BankAccount.balance == Amount(10000, "cad")).one()
SELECT bank_account.id AS bank_account_id,
        bank_account.balance AS bank_account_balance
FROM bank_account
WHERE bank_account.balance = ?
(9886.11,)
BankAccount(9886.1100 usd)

A change is when we want to use an Amount specific operator on the SQL side, namely as_currency(). We use the new class-level version, which produces a new Amount object given against the underlying balance column, coerced into a Numeric and multiplied by the exchange rate:

>>> print Amount.as_currency(BankAccount.balance, "cad")
:param_1 * bank_account.balance cad

In other words, "(exchange rate * numeric balance) in canadian dollars". The returned Amount object maintains the "hybrid" behavior we saw last week:

>>> print session.query(BankAccount).\
...    filter(Amount.as_currency(BankAccount.balance, "cad") > Amount(9999, "cad")).\
...    filter(Amount.as_currency(BankAccount.balance, "cad") < Amount(10001, "cad")).\
...    one()
SELECT bank_account.id AS bank_account_id, bank_account.balance AS bank_account_balance
FROM bank_account
WHERE ? * bank_account.balance > ? AND ? * bank_account.balance < ?
(1.01152, 9999.0, 1.01152, 10001.0)
BankAccount(9886.1100 usd)

With the typed approach, we can finally get Amount objects straight back from a result set:

>>> print session.query(BankAccount.balance).all()
SELECT bank_account.balance AS bank_account_balance FROM bank_account
[(Amount(4000.0000 usd),), (Amount(9886.1100 usd),), (Amount(5700.0000 usd),), (Amount(92338.3808 usd),)]

and it works without the ORM too !

>>> from sqlalchemy import select
>>> bank_account = BankAccount.__table__
>>> for amount, in session.execute(
...            select([bank_account.c.balance]).\
...            where(Amount.as_currency(bank_account.c.balance, "cad") > Amount(9999, "cad"))
...            ).fetchall():
...     print "Amount:", amount, \
...                 "As USD:", amount.as_currency("usd"), \
...                 "As CAD:", amount.as_currency("cad")
SELECT bank_account.balance FROM bank_account
WHERE ? * bank_account.balance > ?
(1.01152, 9999.0)
Amount: 9886.1100 usd As USD: 9886.1100 usd As CAD: 9999.9980 cad
Amount: 92338.3808 usd As USD: 92338.3808 usd As CAD: 93402.1190 cad

Downloads

Download the revised version of account_currency.py:


Hybrids and Value Agnostic Types

October 21, 2011 at 12:01 PM | Code, SQLAlchemy

Update: This is now part one of a two part series. Be sure to read Value Agnostic Types, Part II for an updated approach using TypeDecorator.

A "hybrid" in SQLAlchemy is this awesome idea thought up by Ants Aasma. The hybrid takes advantage of the ability of Python classes to have an attribute which has customizable behavior both at the class level as well as at the instance level. What's that mean? Just this:

class my_descriptor(object):
    def __get__(self, instance, owner):
        if instance is None:
            return "I'm at the class level!"
        else:
            return "I'm at the instance level!"

class MyClass(object):
    some_attribute = my_descriptor()

Above, my_descriptor is what's known as a Python Descriptor. It's a way to have an attribute on an object where the value produced comes from a function. Above, we've made one that will tell you one thing when invoked on the class, i.e. class level behavior:

>>> MyClass.some_attribute
I'm at the class level!

And another at the instance level, i.e. instance level behavior:

>>> my_object = MyClass()
>>> my_object.some_attribute
I'm at the instance level!

That's all we're dealing with here. As it turns out, descriptors are very commonly used in object relational mappers and other "declarative" kinds of systems, where attributes are defined at the class level and maintain behavior on the class, as well as on instances. Suppose we have a SQLAlchemy class like this:

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, Numeric
Base = declarative_base()

class BankAccount(Base):
    __tablename__ = 'bank_account'
    id = Column(Integer, primary_key=True)

    balance = Column(Numeric)
    pending = Column(Numeric)

We've made a new class BankAccount, which represents something like a bank account. It contains fields representing the current account balance, as well as a "pending" balance - say it's checks that haven't cleared yet.

Let's set up a sqlite database as usual and save some accounts:

from sqlalchemy import create_engine
engine = create_engine('sqlite://', echo=True)
Base.metadata.create_all(engine)

from sqlalchemy.orm import Session
session = Session(engine)
session.add_all([
    BankAccount(balance=4000, pending=500),
    BankAccount(balance=10000, pending=0),
])
session.commit()

Descriptors are used here as follows: first, the BankAccount class, as a result of being mapped by SQLAlchemy, has three descriptors on it, named id, balance, and pending. These descriptors have class- and instance-level behavior. When invoked at the class level, they produce SQL expressions:

>>> print BankAccount.balance + BankAccount.pending
balance + pending
>>> print select([BankAccount.id, BankAccount.balance])
SELECT id, balance FROM bank_account

Invoked at the instance level, they provide access to values that are present in __dict__, the same way any other Python object works:

>>> account = BankAccount()
>>> account.balance = 1000
>>> account.pending = 500
>>> account.__dict__
{'balance': 1000, 'pending': 500, '_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1014dde50>}

(_sa_instance_state is SQLAlchemy's tracking object for this particular BankAccount). SQLAlchemy's descriptors, used at the instance level, also keep track of changes to data, and can also invoke queries against the database to load more data if needed.

Taking Advantage of the Descriptor with a Hybrid

The main thing to understand about the previous section is that the attributes of BankAccount, that is id, balance, and pending, are ultimately powered by Python functions, which work both on an instance as well as on a class. The first argument passed to these functions is either self, that is, an instance of BankAccount, or cls, that is, the BankAccount class itself. Both self and cls have the same attributes present on them, and in many ways can be treated identically; we often don't have to care that we're dealing with self or cls. Hybrids take advantage of this symmetry and allow us to expand upon it. The core idea of the Hybrid is not as much a SQLAlchemy thing as it is a general Python thing - you can use the same idea with most other ORMs that feature usage of descriptors in this way.

To illustrate, let's give our BankAccount class a new data member called total, representing the balance amount plus the pending amount. When we have a BankAccount loaded in memory, this is simple addition of the existing values, and there's no need to go out to the database for this. We'll use the well known Python descriptor helper called @property so that we can say myaccount.total as though it were a regular attribute:

class BankAccount(Base):
    __tablename__ = 'bank_account'
    id = Column(Integer, primary_key=True)

    balance = Column(Numeric)
    pending = Column(Numeric)

    @property
    def total(self):
        return self.balance + self.pending

With the above, we can get the total amount present in an account:

>>> BankAccount(balance=1000, pending=500).total
1500

Simple enough. But the hybrid can add a new twist to the above, which is that we can transparently calculate the area over on the database side as well. The expression we have:

self.balance + self.pending

can be written against the BankAccount class, using a query, like this:

>>> print session.query(BankAccount.balance + BankAccount.pending)
SELECT bank_account.balance + bank_account.pending AS anon_1
FROM bank_account

The hybrid provides a single descriptor that can emit either an expression against the instance, or against the class, where self might be either an instance of BankAccount, or the BankAccount class:

from sqlalchemy.ext.hybrid import hybrid_property

class BankAccount(Base):
    __tablename__ = 'bank_account'
    id = Column(Integer, primary_key=True)

    balance = Column(Numeric)
    pending = Column(Numeric)

    @hybrid_property
    def total(self):
        return self.balance + self.pending

So we can now do this:

>>> print session.query(BankAccount.total).filter(BankAccount.total.between(3000, 5000)).all()
SELECT bank_account.balance + bank_account.pending AS anon_1
FROM bank_account
WHERE bank_account.balance + bank_account.pending BETWEEN ? AND ?
(3000, 5000)
[(Decimal('4500.0000000000'),)]

We can see that the total attribute works at the SQL level by emitting a SQL expression, at the instance level like a regular @property:

>>> myaccount = session.query(BankAccount).first()
>>> print myaccount.total
4500.0000000000

While this is a simple example, hybrids have more options for real world cases, particularly the very common case that the SQL expression needs to be defined separately. That, and everything else hybrids can do, are described fully in the document Hybrid Attributes on the SQLAlchemy site.

Value Agnostic Types

Now to introduce a variant on this technique which is not in the docs. The behavior of the hybrid will be offloaded into a new type that's generally not SQLAlchemy specific. We'll apply the technique towards the problem of representing a monetary amount in a currency-agnostic fashion, where exchange rates can be applied relatively transparently. We'll introduce first how we'll work with amounts and exchange rates, starting with a simple lookup dictionary of rates:

from decimal import Decimal

# US dollars, British pounds, Canadian dollar, Euro, Australian dollar
symbols = ('usd', 'gbp', 'cad', 'eur', 'aud')
currency_lookup = dict(
            ((currency_from, currency_to), Decimal(str(rate)))
            for currency_to, values in zip(
                symbols,
                [
                    (1, 1.59009, 0.988611, 1.37979, 1.02962),
                    (0.628895, 1, 0.621732, 0.867748, 0.647525),
                    (1.01152, 1.6084, 1, 1.39569, 1.04148),
                    (0.724743, 1.1524, 0.716489, 1, 0.746213),
                    (0.971228, 1.54434, 0.960166, 1.34009, 1),
                ])
            for currency_from, rate in zip(symbols, values)
    )

The above creates a dictionary currency_lookup which contains exchange rates between pairs of symbols. If we had the value of 5000 in US dollars (USD), and wanted to convert to Australian dollars (AUD), we'd look up the key ("usd", "aud") in the dictionary, giving us a rate to multiply by 5000:

>>> print currency_lookup[("usd", "aud")] * 5000
4856.140000

Next we'll build an interface for this dictionary, an Amount object that represents the combination of a monetary amount, and a particular currency symbol. The Amount object acts like a numeric value, but intercepts Python math and comparison operations to first coerce the "right" side of an equation to use the same currency as the "left" side:

class Amount(object):
    def __init__(self, amount, currency):
        self.currency = currency
        self.amount = amount

    def __add__(self, other):
        return Amount(
                self.amount +
                other.as_currency(self.currency).amount,
                self.currency
            )

    def __sub__(self, other):
        return Amount(
                self.amount -
                other.as_currency(self.currency).amount,
                self.currency
            )

    def __lt__(self, other):
        return self.amount < other.as_currency(self.currency).amount

    def __gt__(self, other):
        return self.amount > other.as_currency(self.currency).amount

    def __eq__(self, other):
        return self.amount == other.as_currency(self.currency).amount

    def as_currency(self, other_currency):
        return Amount(
                    currency_lookup[(self.currency, other_currency)] * self.amount,
                    other_currency)

    def __str__(self):
        return "%2.4f %s" % (self.amount, self.currency)

Above, while we have only implemented a small set of Python comparison and math operations, the Amount implementation above can handle addition, subtraction, the equals operator, and the less than/greater than operations. All coercion is via a single method as_currency() which just looks up the exchange rate in the currency_lookup table, and converts the "other" side of any equation into a new Amount that is against the source currency:

>>> # Convert 5000 US dollars into Canadian dollars
>>> print Amount(5000, "usd").as_currency("cad")
5057.6000 cad

>>> # Calculate 5000 USD + 1000 USD:
>>> print Amount(5000, "usd") + Amount(1000, "usd")
6000.000 usd

>>> # Calculate 5000 USD + 500 GBP:
>>> print Amount(5000, "usd") + Amount(500, "gbp")
5795.0450 usd

>>> # Take the above and convert to CAD:
>>> print (Amount(5000, "usd") + Amount(500, "gbp")).as_currency("cad")
5861.8039 cad

The thing to keep in mind going forward is that, while we're about to do something more interesting with our Amount class, it will always be used as an instance of an Amount - we're not going to map this class at all. What we're instead going to do is take advantage of the hybrid approach and apply it to the argument we pass to Amount. Suppose we passed a SQLAlchemy expression construct to Amount, instead of a numeric value. What would happen ?

>>> from sqlalchemy.sql import column
>>> balance_col = column('balance')
>>> print Amount(balance_col, "aud").as_currency("eur").amount
:balance_1 * balance

Passing in a sqlalchemy.sql.column object to Amount instead of a numeric value, we got a SQL expression straight out of the Amount object, which itself had no idea it wasn't handling a number. Note we needed to access .amount in order to display it, as the expression would not have made it through our number formatting Amount.__str__() method. But otherwise, we can pass any type of object into Amount which accommodates basic calculations and it will work fine.

Mapping with Value Agnostic Types

To apply the above Amount object to a mapping, we'll again go back to hybrids. The technique we'll show here uses the object model to coerce between Amount and a raw numeric value. Another option is to use custom types via TypeDecorator - this is detailed in the next post, Value Agnostic Types, Part II. A great advantage of using a hybrid with the value agnostic type is that the whole thing is almost totally orthogonal to SQLAlchemy itself, with very few moving parts or library dependencies.

Let's start up a new BankAccount class, this time the BankAccount will specifically be an account that stores funds in US dollars. We'll store just a balance for now, and the public interface for the balance will be via the Amount object:

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, Numeric
from sqlalchemy.ext.hybrid import hybrid_property
Base = declarative_base()

class BankAccount(Base):
    __tablename__ = 'bank_account'
    id = Column(Integer, primary_key=True)

    _balance = Column('balance', Numeric)

    @hybrid_property
    def balance(self):
        return Amount(self._balance, "usd")

    @balance.setter
    def balance(self, value):
        self._balance = value.as_currency("usd").amount

    def __repr__(self):
        return "BankAccount(%s)" % self.balance

The balance column is applied to a private attribute _balance, and the public accessor balance uses a hybrid to coerce to/from an Amount object, hardcoding "usd" as the base exchange currency. Given a new BankAccount, we can operate on "balance" at the Python level, using Amount objects as value objects:

>>> account = BankAccount(balance=Amount(4000, "usd"))
>>> print account.balance
4000.0000 usd

>>> print account.balance.as_currency("gbp")
2515.5800 gbp

>>> print account.balance > Amount(500, "cad")
True

>>> print account.balance + Amount(500, "cad") - Amount(50, "eur")
4425.3160 usd

All we're doing here is accessing the .balance attribute on an instance of BankAccount, where the Amount instance returned acts against the underlying value.

But note we've applied @hybrid_property to .balance. That means we want to use it as a SQL expression too. Used in a SQL context, it returns the same Amount object, passing in the ._balance Column object, working as it did earlier. Starting with some test data:

from sqlalchemy import create_engine
from sqlalchemy.orm import Session
from sqlalchemy import func

engine = create_engine('sqlite://', echo=True)
Base.metadata.create_all(engine)

session = Session(engine)

session.add_all([
    account,
    BankAccount(balance=Amount(10000, "cad")),
    BankAccount(balance=Amount(5700, "usd")),
    BankAccount(balance=Amount(89682, "aud")),
])
session.commit()

Above, all the balances are coerced from their source currencies to USD, via the @balance.setter portion of our hybrid, which is using value.as_currency("usd").

We can then query BankAccount objects using the .balance column more or less as is, such as in a simple comparison:

>>> print session.query(BankAccount).\
...    filter(BankAccount.balance == Amount(10000, "cad")).one()
SELECT bank_account.balance AS bank_account_balance, bank_account.id AS bank_account_id
FROM bank_account
WHERE bank_account.balance = ?
(9886.11,)
BankAccount(9886.1100 usd)

Notice above the value was converted on the Python side before being converted to a SQL expression. This is due to the coercion occurring upon the right side, which is an Amount object against a numeric in-Python value.

If we told the left side to convert itself, then we'd be getting the currency conversions on the SQL side as well:

>>> print session.query(BankAccount).\
...    filter(BankAccount.balance.as_currency("cad") > Amount(9999, "cad")).\
...    filter(BankAccount.balance.as_currency("cad") < Amount(10001, "cad")).\
...    one()
SELECT bank_account.balance AS bank_account_balance, bank_account.id AS bank_account_id
FROM bank_account
WHERE ? * bank_account.balance > ? AND ? * bank_account.balance < ?
(1.01152, 9999.0, 1.01152, 10001.0)
BankAccount(9886.1100 usd)

So far the above operations take advantage of the fact that the __eq__(), __lt__() and __gt__() methods of Amount are returning SQL expressions. To interpret the Amount object directly, at the moment we need to use the .amount accessor:

>>> # print all amounts as CAD
>>> print session.query(BankAccount.balance.as_currency("cad").amount).all()
SELECT ? * bank_account.balance AS anon_1
FROM bank_account
(1.01152,)
[(Decimal('4046.0800000000'),), (Decimal('9999.9979872000'),), (Decimal('5765.6640000000'),), (Decimal('93402.1189872768'),)]

>>> # return the average balance in EUR
>>> print session.query(func.avg(BankAccount.balance.as_currency("eur").amount)).scalar()
SELECT avg(? * bank_account.balance) AS avg_1
FROM bank_account
(0.724743,)
20279.1228162

The situation can be improved in most cases if we add a __clause_element__() method to our Amount class - the full example script includes this. There's also a few other rough edges to this approach which I may try to address in a future release. But overall, I use this pattern quite a bit as presented in several ways. It in fact stretches all the way back to similar patterns I used when I was primarily a Java programmer, though the Python version is much more flexible. It allows you to forget about certain kinds of numeric conversion details which now take care of themselves, even in SQL expressions, in a simple and transparent way.

Downloads


Pycon Wants You! By October 12th!

October 03, 2011 at 12:01 PM | Code

The upcoming Pycon US is going to be a total blast. No more snowy Chicago, Atlanta was fun but the hotel was under construction, this year we'll be on the sunny west coast at a convention center ! As I'm trying to encourage myself to speak more, I'll likely be doing some version of my SQLAlchemy Retrospective talk, but what Pycon really needs is you - submit a talk or tutorial! I'd personally love to see more talks about using Python in "everyday" kinds of jobs - did you get your boss to let you port some old Java app to Python? Using Python in finance, medicine, insurance, education? We'd (well at least I would) love to hear about that! Another option is a tutorial - these are three hour sessions focused on getting people started (or further) on some of the current technologies. My biased list might include Pyramid, Celery, SQLAlchemy, NLTK - tutorials are paid and are a great way to get your trip to Pycon mostly paid for.

Whatever your talk or tutorial is about, the deadline for a proposal is October 12, 2011. See you in 2012.


Thoughts on Beaker

October 01, 2011 at 12:01 PM | Code, Mako/Pylons

Beaker is a widely used caching and HTTP session library published by Ben Bangert. It dates back to the early days of Pylons, when Pylons moved off of the Myghty base to the Python Paste infrastructure. Beaker's guts in fact were originally the internals to the caching and session system of Myghty itself, which in turn was loosely based on the caching code from HTML::Mason (which if those two colons don't look familiar means we've made the jump across the ages to Perl).

The key neato thing nestled deep inside of Beaker is what I later was told is called a "dogpile lock". This is a simple, but slightly less simple than dict.get() type of system whereby the cache can put one thread to work on generating a new value for a certain cache key, while all the other threads and even other processes can all continue to return the expired value, until the new value is ready. I spent weeks getting this aspect to do what it was supposed to. It does things in a slightly more complex but comprehensive way than Mason did; it coordinates workers using a mutex and/or lockfiles, instead of a simple counter that estimates the creation time for a new value.

The storage backends are implemented in terms of a "container", which deals with an individual key, including checking if it's expired, regenerating it, and so forth, and a "namespace", which is the entryway to storing the data. It deals with keys and values like everything else does, and has backends for pickled files, DBM files, relational databases, memcached, and GAE. The general idea here came from Myghty but has been highly modified since then.

Beaker then adds something of a coarse-grained facade over this model, including something of a generic get/put API as well as a system of function decorators that cache the return value of a function. In recent releases it also supports the notion of a "region", which is just a way to package up a particular cache configuration under a single name that can later be referenced by high level caching operations.

Beaker also implements an HTTP session object on top of the same storage backends, taking this from Myghty as well. Implementing HTTP sessions on top of the cache backends was a completely off-the-cuff idea in Myghty, and over the years in Beaker it's had to be wrangled and re-wrangled over basic issues resulting from this mismatch.

As far as the backends themselves, at this point all of them have accumulated a fair degree of woe. The file- and DBM- backends basically perform pretty poorly (DBM maybe not as much). The memcached backend has always been problematic - we slowly learned that the cmemcache library is basically a non starter, and that pylibmc is dramatically faster than the usual memcache library. The memcached backend slowly sprouted an awkward way to select among these backends and more awkwardness to attempt to accommodate pylibmc's special concurrency API. Then when users started combining the memcached backend with HTTP sessions, all kinds of new issues occurred, namely that any particular key might be removed from memcached at any time, that caused us to rework the session implementation even more awkwardly to store the whole session under one "key", defeating the purpose of how the session object works with other backends. As for the SQLAlchemy backends, there are two, and I've no idea why.

The HTTP session as a "bag of random keys", to paraphrase a term I think PJE originally used, is not something I really do these days, as I'm pretty skeptical of the approach of storing a bag of key/value pairs inside of a big serialized BLOB as a solution to anything, really. A few keys in a cookie-based session of very limited size is fine. But a backend-oriented system where you're loading up a giant bag of dozens of keys referencing full object structures inside of a good sized serialized blob inside of a file, database, or memcached-type of system, is both horribly inefficient and a sign of sloppy, ad-hoc coding.

Beaker eventually provided a purely client side encrypted cookie session as an alternative to the server-based sessions. This session implementation stores all the data in an encrypted string (Ben spent a lot of time and got a lot of advice getting the encryption right here) which is stored completely on the client. All of the performance, scalability and failover issues introduced by typical memory or file based HTTP session implementations is solved immediately. So this session implementation, I use lots. I only put a few critical keys in it to track the user, but anything that is more significant is part of the object model which is persisted in the database as a first class object.

The only downside to the cookie-only session is that, unless you coordinate the state of the session with server-side state (which IMHO you should), you can't "force" it to expire sooner than it normally would, or otherwise prevent the replay of previous state, within an attack scenario. While an attacker can't craft a session, (s)he can re-use the same cookie over and over again, and that can only be guarded against by comparing its state to that of a memo on the server; a single "generation" counter can achieve this. I.e. session comes in, generation is "5", the server says it should be "7", reject it. So it may be considered that this turns the encrypted cookie session into a fancy session cookie for a server-side session. But I am actually OK with that. I put very, very little into unstructured sessions.

What I'm getting at here is that I'm not super-interested in most of what Beaker has - a lot of the APIs are awkward, all of the storage backends are either mostly useless or slow and crufty, and I don't care much for the server-side session thing especially using backends that were designed for caching. Overall Beaker creates a coarse opacity on top of a collection of things that I think good application developers should be much more aware of. I really think developers should know how their tools work.

Beaker has some isolated features which I think are great. These are the dogpile lock, the encrypted client-side cookie session, the concept of "cache regions" whereby a set of cache configuration is referencable by a single name, and some nice function decorators - these allow you to apply caching to a function, similarly to:

@cached(region="file_based", namespace="somenamespace")
def return_some_data(x, y, z):
    # ...

I'm not sure what's in store for Beaker. But what I'm doing is:

  • Created Dogpile, which is just the "dogpile lock" portion of Beaker, cleaned up and turned into a very clear, succinct system which you can use to build your own caching implementation. The README includes a full example using Pylibmc.
  • In Mako, for either 0.5.1 or 0.6.0 (not sure yet, though I just put out 0.5.0 the other day for unrelated reasons) the built-in Beaker support is genericized into a plugin system. You can write your own plugins very easily for Mako using Dogpile or similar, if you want caching inside your templates that isn't based off of Beaker. The system supports entrypoints so if a Beaker2 comes out, it would publish a cache backend that Mako could then use.

So we have that. Then, for the community I would like to see:

  • The solution for the key/value backend (or not). Today, high-performance, sophisticated key/value stores are ubiquitous. We of course have memcached. We also have Riak and Redis which appear very feasible for temporary storage, I'm not sure if Tokyo Tyrant is still popular. And of course there's the more permanent-oriented systems like Mongo and Cassandra. I am sure that someone has built some kind of generic facade over these, and there are probably many. These are the key/value systems you should be using, either via a modern facade or one of the libraries directly; Beaker's system, which was originally built to store keys inside of dictionaries or pickled files, is absolutely nothing more than a quaint, historical novelty in comparison.
  • A new library that is just the encrpyted cookie session. I'd use this. It would be very nice for this portion of Beaker to be its own thing. Until then I may just rip out that part of Beaker and stick it in a lib/ folder somewhere in my work projects for internal use.
  • For server-side HTTP sessions, I really think people should be rolling solutions for this as needed, probably using the encrypted cookie session for client side state linked to first class model objects in the datastore (relational or non-relational, doesn't matter). If accessing the datastore is a performance issue, you'd be using caching for those data structures, the same way you'd cache data structures that are not specific to a user (probably based on primary key or on their originating query). In the end this is similar to HTTP sessions stored directly in the cache backend, except there's a well defined model layer in between.
  • People that really insist on the pattern of server-side HTTP sessions as bags of keys inside of serialized blobs on a server should be served by some new library that someone else can maintain. I think I am done with supporting this pattern and I think Ben may be as well.

The theme I've been pushing in my recent talk on SQLAlchemy and I think I'm pushing here, is that if you're writing a big application, one which you'll be spending months or years with, it is very much worth it to spend some time spending a day or so building up foundational patterns. I really have never understood the point of, "I just got the whole app up and running from start to finish on the plane!" Well great, you got the app up in three hours, now you can spend the next six months reworking the whole thing to actually scale (or if it was only a proof-of-concept, then why does it need a cache or even a database?). Did three hours for a quick and dirty implementation versus three days to do it right really improve your life within the scope of the total time spent? Breaking up Beaker into components and encouraging people to use patterns instead of relying upon opaque, pushbutton libraries is part of that idea.