Revisiting Storm, SQLAlchemy, and Geniusql

December 26, 2007 at 03:00 PM | Code, SQLAlchemy

We're revisiting Robert Brewer's Storm, SQLAlchemy and Geniusql. In that post, Robert set up a series of test suites intended to exercise the execution of basic INSERT, UPDATE and SELECT statements, using the established paradigms of each toolkit. For SQLAlchemy, he used SQLAlchemy's SQL expression language and not its ORM, Geniusql's expression language, and mostly Storm's ORM except for one or two tests. The article observes poor performance mostly on the part of SQLAlchemy compared to the other two, and spends lots of time poking at SQLAlchemy stack traces, making various guesses on why we're so slow, and issuing various somewhat damning comments about SA's design. SQLAlchemy responded immediately with some commentary as well as a new round of speed enhancements in response to some of the "cleanup" items Robert mentioned; but we only focused on SQLAlchemy - we didn't get around to digging deeper into each individual platform's tests and behavior to see what really made them different. In this post, we've finally done that, and the results are very enlightening.

Robert's results have always seemed strange to SQLAlchemy users since they implied a SQLAlchemy that was almost unusably bloated, which is not observed in the real world. Robert is an experienced developer so I didn't have much doubt, as I'm sure others didn't either, that he was careful in constructing his tests and ensuring that they were working as expected. As a result, SQLAlchemy spent a lot of focus on the SA tests, trying to analyze places where we might be able to become faster. Nobody, as far as I know, spent any time looking at the other two tests to see why they might be so fast, particularly in the case of Storm where even its ORM, which by definition should be slower than direct SQL constructs, seemed considerably faster than SA's lower level SQL construction language...and curiously, that of Geniusql as well.

It turns out that with just a few minor corrections, even SQLAlchemy's previous 0.3 series comes in second place, slightly slower than Geniusql and usually faster than Storm's ORM (remember that the SQLAlchemy tests are against the lower-level expression language, so they should be faster than an ORM). For all of Robert's detailed critique of SQLAlchemy's execution and analysis of its "slowness", it's hard to imagine that he made sure to check that both Storm as well as his own Geniusql were also doing what was expected; since a brief look at the actual SQL emitted by each test immediately reveals where most of the extra SQLAlchemy time is being spent - in SQL execution which is plainly not occurring (even though assumed to be) in the case of the other two products.

So to set the record straight (better late than never), I've fixed the tests, modified them to run under a common set of fixtures, and fired up the charts and graphs. I've added a fourth suite to the tests, a SQLAlchemy ORM test which almost duplicates the Storm ORM test on a line-by-line basis. The test platform is the MacBook Pro 2.2 Ghz Intel Core Duo, using Python 2.5.1, Postgres 8.2.5, Psycopg2-2.0.6. SQLAlchemy is trunk r3974 (will be release 0.4.2), Storm version 0.11, and Geniusql r248. The actual speed comparisons between these tools are quite boring. How boring ? This boring:

/files/2007/ssgbenches/total_tests_1_thru_6.png

Where above, Storm runs very fast except for test #4, which as it turns out takes up 78% of the suite's total runtime for the two ORMs. The ORMs are generally similar in speed and a brief test of Storm's expression language in test #5 is faster than both SQLAlchemy and Geniusql. Geniusql, which is still alpha and notably does not use any bind parameters, is consistently a little faster than SQLAlchemy EL.

The full test suite and everything else is viewable here. Of note are the unified diffs which compare my version of the tests to Robert's originals, as well as a diff of the Storm ORM test against the new SQLAlchemy ORM test. I've tried to change as little as possible, and the main structural difference is the common benchmark runner.

So what was happening with Robert's tests ? Just two main points:

SQLAlchemy is instructed to issue hundreds of COMMIT statements. Storm is instructed to issue exactly one, Geniusql fails to issue any (even though instructed to).

Robert relied upon SQLAlchemy's "autocommit" feature for all INSERT and UPDATE statements; a COMMIT is issued to the database after every statement. Whereas with Storm, he used the ORM which does not "autocommit", you have to call store.commit() explicitly, which he did not. Additionally, he sets this flag on Geniusql: db.connections.implicit_trans = True, which is meant to instruct Geniusql to issue COMMIT after each INSERT or UPDATE. But at least in the most current SVN revision of Geniusql, this flag does nothing whatsoever; grepping through PG's SQL logs reveals this. So to compensate for the lack of transactions in the other two test suites, changing the SQLAlchemy engine declaration from this:

db = create_engine(uri)

To this:

db = create_engine(uri, strategy="threadlocal")
db.begin()

Thus starting a transaction and disabling the COMMIT after each statement, allows the SQLAlchemy SQL expression test, even using the old version 0.3, to hug very closely to Geniusql's. This is why SQLAlchemy seemed so slow in the tests which involved issuing INSERTs and UPDATEs. - the overhead of many hundreds of transaction commits within Postgres is profoundly significant. All the huge amounts of code cleanup and optimizations we've added between 0.3 and the current 0.4 don't approach the time saved by this one small change.

But we're not done yet; another factor was key in the huge graphs posted for test numbers 2 and 3:

Storm tests two and three issued almost no SQL whatsoever.

As surprising as this seems, Robert's graph illustrating Storm supposedly executing hundreds of INSERT statements and SELECT statements at unbelievable speeds actually depicts Storm issuing no INSERT or SQL whatsoever in test #2, and no SELECTs whatsoever in test #3. Let's look:

def step_2_insert(self):
    for x in xrange(ITERATIONS):
        store.add(Animal(Species=u'Tick', Name=u'Tick %d' % x, Legs=8))

What's missing there ? Theres no flush() ! Adding a single flush() at the end of the iterations allows the items saved to the store to actually be emitted to the database as INSERT statements:

def step_2_insert(self):
    for x in xrange(ITERATIONS):
        store.add(Animal(Species=u'Tick', Name=u'Tick %d' % x, Legs=8))
    store.flush()

And suddenly the test performs in a believable amount of time, 0.12 seconds for 100 iterations instead of .015. The reason things seem to "work" without the explicit flush is because Storm issues its 0.12-second flush at the beginning of test #3, so that the data is available for subsequent tests - but because test #3 is meant to take nearly twenty times longer than test #2, this tiny hit at the beginning is unnoticeable, and test #2 gets away with issuing no SQL at all.

For test #3, I have a feeling Robert might have generated his graphs from a more reasonable test, considering that he mentioned using store.find().one(); but as far as what he posted for the actual test, Robert is issuing finds using Storm's ORM, but he's not asserting the results, and he's doing it like this:

WAP = store.find(Zoo, Zoo.Name==u'Wild Animal Park')

As Robert said, "There must be some things that SQLAlchemy is doing that the others are not." Indeed - issuing the expected SELECT statements. The Storm Store, like SQLA's Query, does not evaluate results until called in an iterative context. So fix as follows:

WAP = list(store.find(Zoo, Zoo.Name==u'Wild Animal Park'))

And once again, the miraculous speed of .02 seconds for 800 full SELECT statements becomes a believable 1.4 seconds.

The actual results for test number two are as follows:

/files/2007/ssgbenches/step_2_insert.png

A particular thing to note about test #2 is that we are dealing with a total time of less than one second, even for 500 iterations. Even though you see those four lines diverging, its a horse race around the head of a pin; the differences here are not very important compared to other areas.

For number three:

/files/2007/ssgbenches/step_3_Properties.png

This test takes considerably longer, and SQLAlchemy's ORM and Storm's are performing very similar operations, as are SQLAlchemy's expression constructs and Geniusql's.

Let's look at test #4 again. This test involves a battery of about 21 SELECT statements, fully fetched, repeated for a number of iterations; so 100 iterations would be 2100 SELECT statements and result sets fully fetched.

/files/2007/ssgbenches/step_4_Expressions.png

This is the only test where Storm's ORM can possibly increase its efficiency in fetching; however, you can see that SQLAlchemy's ORM needs to perform a similar amount of work as Storm (we've just optimized the crap out of our ORM fetches in recent versions); fetching ORM-mapped objects is a much more complex operation than fetching rows, as the identity of each row needs to be evaluated, a matching already-loaded object must be searched for, and if not found a new object is instantiated and each of its attributes populated. SQLAlchemy's SQL constructs perform similarly to Geniusql, and you can be sure that Storm's SQL constructs are similarly fast.

Test #5, "aggregates", issues four SELECT statements for a number of iterations (i.e. 400 for 100 iterations) using the raw SQL constructs of each platform, ORMs included:

/files/2007/ssgbenches/step_5_Aggregates.png

Storm uses fewer method calls for this operation, but it curiously makes more of a difference in just this test than in other tests where it also has fewer calls. On that topic, here's a graph of comparative function callcounts, as reported by hotshot, for one iteration in each test:

/files/2007/ssgbenches/callcounts.png

This data falls in line with what we'd expect; SQLAlchemy's ORM is more complex than that of Storm and uses more function calls. The SQL expression language obviously uses less than either ORM in most cases - this graph directly contradicts most of Robert's comments to the effect of "SQLAlchemy (specifically the expression language) is doing much more than the others". Also interesting is that while Geniusql, a very small toolkit only at rev 248, uses less calls than SQLAlchemy's expression language, the ultimate speeds of both turn out to be somewhat similar in all tests (at least, more similar than their callcounts are different), which implies that the time taken on the database side is a significant factor in speed.

Test number six shows competitive results between all products. This test loads a record, changes it, and reselects it, then repeats the process.

/files/2007/ssgbenches/step_6_Editing.png

Storm makes usage here of its ability to issue a quick UPDATE statement and invalidate its ORM cache simultaneously. SQLAlchemy EL clocks comparably to Storm's optimized operation, whereas SQLAlchemy ORM needs to fall back on a slightly more expensive flush() to update the record in the database and session simultaneously. To address this, SQLAlchemy will be adding a criterion-based session invalidation feature to its Query object in an upcoming release, possibly 0.4.3.

Robert didn't implement test number seven for the ORM, so the head to head matchup is below. Nothing too interesting here:

/files/2007/ssgbenches/step_7_Multiview.png

So admittedly six months too late, I'm hoping this illustrates that there are only slight to moderate differences in performance between these products at a Python execution level. Your app won't grind to a halt with SQLAlchemy nor perform miraculous volumes of SQL in mere milliseconds with Storm or Geniusql - the data Robert illustrated as well as his hypotheses to explain those findings are just inaccurate. All the hypothesizing about connection pool slowness and everything else mentioned, while not insignificant, have never created huge performance issues, and we of course continue to optimize and re-simplify all areas of SQLAlchemy on a constant basis (where the "de-simplification" occurs in response to new behaviors and feature improvements which we add all the time). SQLAlchemy maintains that the best way to get the fastest performance from a database application remains to use the most appropriate SQL at the most appropriate times; exactly the reason why SQLAlchemy has a very rich expression language and SQL compiler, and a highly featured and finely-tunable ORM which improves with every release.


SQLAlchemy 0.4.0 Released

October 17, 2007 at 05:44 PM | Code, SQLAlchemy

After much development effort and plenty of beta testing, version 0.4 of SQLAlchemy is released. For those who haven't been following along, this release represents several months of refactorings, rewrites, API reorganizations, and re-documenting of virtually every area of SQLAlchemy's functionality. A handful of core developers and dozens of contributors have helped this release to broadly address many areas, including:

  • major code simplifications and optimizations within the SQL expression API, execution system, and ORM
  • rewritten SQL compiler which allows inlined column defaults, deterministic generation of anonymous alias and label names, unicode identifier support, and dialect-sensitive operator generation; it's also nearly twice as fast
  • almost completely rewritten documentation, with two new tutorials.
  • a much more powerful and smart SQL expression API
  • a much simplified, consistent, and capable Query object
  • high level ORM query operations, such as aliased joins, self-referential joins, operators like any() and has(), without the need for explicit Table/Alias usage
  • much better integration of transaction control with ORM sessions
  • core ORM support for vertical and horizontal table partitioning across multiple databases
  • vastly improved mapper configurations, including a comprehensive collections API, simple polymorphic inheritance configuration, composite datatypes comprised of multiple columns, "dynamic" class attributes which query and filter subsets of large collections
  • support for SAVEPOINT and two-phase transactions
  • more database support, including Postgres' "UPDATE..RETURNING" syntax, Mysql's "UPDATE..LIMIT" syntax, PG array types, new dialects for Sybase and MS Access

If you haven't looked at SQLAlchemy lately, this is the release to look at...we've really pulled out all the stops on this one.


Major Speed Enhancements in SQLAlchemy 0.4

August 14, 2007 at 08:12 PM | Code, SQLAlchemy

A few weeks ago, Flávio Codeço Coelho posted a comparison of Python database interfaces, with a test that focused primarily on mass insert speed. While he had good things to say about our performance, I didn't find the results to be very impressive, and not surprisingly - mass insert speed is something we've never really focused on. The idea being, if you have to insert 100,000 records, you'd probably be using a bulk insert tool or raw SQL scripts anyway.

SQLAlchemy's development has historically been focused on the richness of its SQL expression language and the completeness of its object relational mapper. While it was important that applications using these features don't use up all the system's memory and grind to a halt, at the same time we've never had a "speed-crazy" mindset.

Contrast this to my other project, Mako, where speed was the singlemost important priority; every line of code written was checked against a benchmark suite to measure it's impact. The language itself is pretty barebones, and even a little inconvenient in some ways. That effort provided a pretty fast product which it's rumored will form the basis of the newly designed reddit.com.

But while the "speed at all costs" mindset is very relevant for a template language, it somehow doesn't have the same urgency in the land of SQL, where if people truly want blazing speed, they use raw SQL. The point of a SQL toolkit and/or an ORM is to enhance the developer experience, to bring richness and ease to common (and not so common) use cases. Such as, being able to issue SQL statements with Chinese-language identifiers with all the DBAPI-divergent trickery hidden behind the scenes, or more commonly just to have smooth Unicode support, numeric and date consistency, richly featured default generators, support for database-specific features like Oracle OUT parameters and Psycopg2's server-side cursors, a healthy dose of hooks and options for ORM behavior, and everything else. The sheer volume of behaviors we are shooting to deliver place a burden on the architecture, which constantly has to refactor and adapt in order to accomodate (and SQLAlchemy is nothing if not accommodating). Not to mention it heavily weighs upon the ability for a small group of unpaid volunteers to get around to reviewing, streamlining, and optimizing all of it.

Nevertheless, the 0.4 release is just plain ginormous, and is changing, streamlining, refactoring just about everything. We have more users, more developers, more trac tickets, so a lot of what was always just "potential" is finally becoming a reality. We even have Robert Brewer of Dejavu (and of course CherryPy) writing benchmark tests for us (although, he might just be trying to get Geniusql to kick our butt...in the most friendly way of course). There's a lot of energy right now so I try to get as much production out of that wave as possible while it lasts.

So the particularly egregious slothiness SQLAlchemy exhibited in Flávio's tests (and really, he picked probably the worst thing he could within SA) has been overhauled. An adaptation of his test, which cranks it up by not only inserting but also fully selecting the results, inserts and selects 50,000 rows from an in-memory SQLite database. When run against 0.3, Hotspot shows this extremely large number of function calls:

2850532 function calls (2850529 primitive calls) in 19.501 CPU seconds

That's really bad. In 0.4 trunk, we have it down to this:

900525 function calls (900513 primitive calls) in 8.128 CPU seconds
Edit from the future ! SQLA 0.6 does this test in 300,422 function calls.

That's 68% fewer function calls to insert and iterate 50,000 rows of three columns each. It's not that the new system is all that blazingly fast; a raw insert of 100,000 rows with SQLite takes 4 seconds on my machine, while with SA takes 14 seconds. Still, that's a whole lot better than 0.3 which takes 32 seconds. Speedups like this are not due to any brilliant re-architecting, it's just having enough people helping out such that we have the resources to address these old, crufty functionalities that got written at 2:30 AM eight months ago.

For everything that gets said about SQLAlchemy, we are really just barely out of the "idea" phase, and it was meant to take a long time for this thing to be really great; especially at "homegrown" pace. Our version number being only 40% of 1.0 is quite intentional.


Polymorphic Associations with SQLAlchemy

May 29, 2007 at 09:51 PM | Code, SQLAlchemy

Updated 4/25/2011 - A new series of examples are committed to the 0.7 branch which illustrate several ways of associating a particular related object with multiple parents - these are modern examples which are based on Declarative and Declarative Mixins. One of these scripts, an updated polymorphic association example, can be found at the bottom of this post.

The other day Ben Bangert pointed me over to Understanding Polymorphic Associations, a method used by ActiveRecord to allow a certain type of object to be associated with multiple kinds of parent object, along with the question "Can SQLAlchemy do this?".

My initial mental reaction when I hear "can SQLAlchemy do XYZ" is "probably", with the hopes that it isn't going to require a total refactoring of internals to make it happen. In this case, it took me awhile to figure out exactly what AR is getting at in this one, since my own cognitive dissonance was being heavily influenced by the meaning of "polymorphic" in SA terminology. In SA, the word "polymorphic" is usually meant to mean "a mapper that loads instances across a class hierarchy in one query". This means if you have a mapper for class Vehicle, and inheriting mappers for Car, Train and Airplane, you can load a list of Car, Train, and Airplane objects from just one Query call (typically via a UNION ALL statement...but some new twists to that idea are coming up in SA 0.4).

The "polymorphic association", on the other hand, while it bears some resemblance to the regular polymorphic union of a class hierarchy, its not really the same since you're only dealing with a particular association to a single target class from any number of source classes, source classes which don't have anything else to do with each other; i.e. they aren't in any particular inheritance relationship and probably are all persisted in completely different tables. In this way, the polymorphic association has a lot less to do with object inheritance and a lot more to do with aspect oriented programming (AOP); a particular concept needs to be applied to a divergent set of entities which otherwise are not directly related. Such a concept is referred to as a cross cutting concern, such as, all the entities in your domain need to support a history log of all changes to a common logging table. In the AR example, an Order and a User object are illustrated to both require links to an Address object.

While studying AR's example, the primary thing I was looking for was not the various keywords and directives used to express the object relationship in Ruby, instead I was looking for how the database tables are laid out. Once I found that, the exact nature of the relationship would immediately become apparent. This is one thing I like about SQL; it cuts through all the abstracted mumbo jumbo and just shows you the raw relationships in the most plain spoken way possible. Like virtually every ORM I've ever looked at, I could see AR's docs were no exception in that they were long on configurational/usage patterns and very brief on illustrating actual SQL. SQLAlchemy is the only ORM I know of, for any platform, which does not take the position of SQL as an "implementation detail". It takes the position that the developer wants to deal with domain objects as well as SQL/relational concepts at equal levels. Does this violate abstraction and/or encapsulation ? Not at all, SQLAlchemy isn't in the business of building an opaque abstraction layer for you. Its just providing the richest zone of overlap between the two idioms as it possibly can so that you can build the best abstraction for your needs.

So I found AR's primary table for its users/orders/addresses example to be the addresses table. This is mostly a regular looking table, with some extra columns added to create the association. Lets illustrate the AR example piece by piece by building a copy of it in SA. We'll start with just the concrete tables and relationships and worry about generalizing a feature out of it later.

We'll start with the standard SQLAlchemy boilerplate for SA examples...import the whole namespace in just to save typing, set up an anonymous connection to an in-memory sqlite database:

from sqlalchemy import *
metadata = BoundMetaData('sqlite://', echo=False)

Next lets look at their addresses table (I added a surrogate key because I think AR is actually going to put one there):

addresses = Table("addresses", metadata,
    Column('id', Integer, primary_key=True),
    Column('addressable_id', Integer),
    Column('addressable_type', String(50)),
    Column('street', String(100)),
    Column('city', String(50)),
    Column('country', String(50))
    )

A totally normal address table, except for the combination of the addressable_id and addressable_type columns. Normally, you'd see a column such as user_id or customer_id with an explicit foreign key constraint to a table like users or customers, so that any number of address rows can be associated to that particular referenced table. But here, addressable_id has no foreign key whatsoever. In this example, that's intentional. It's because many different "source" tables can conceivably be referenced by addressable_id against their surrogate primary key column. From this, it follows that if more than one source table has the same primary key value, its ambiguous which of those tables is referenced by a particular addressable_id value. Therefore, the addressable_type column qualifies <em>which</em> of those tables we should be looking at. Such as, if two tables users and orders both wanted to reference rows in addresses, you'd get this:

users:      user_id   name
            -----------------
              1       bob
              2       ed

orders:     order_id    description
            -------------------------
              1          order #1
              5          order #2

addresses:    id   addressable_id   addressable_type    street
            ----------------------------------------------------------------------
              1          1            'user'           123 anywhere street
              2          1            'user'           345 orchard ave
              3          2            'user'           27 fulton st.
              4          1            'order'          444 park ave.

The first three rows of addresses have an addressable_id which references rows in users, the last row has an addressable_id which references a row in orders. So to get all the address rows that are mapped to user id 1, you'd say:

select * from addresses where addressable_id=1 and addressable_type='user'

If you didn't have the extra criterion for addressable_type, you'd also get address id 4, which is related to an order, not a user.

As a side note, I've been noticing that AR doesn't even support foreign keys out of the box, instead requiring a plugin in order to deal with them. People say this is due to AR's history as a MySQL-oriented ORM, and as we all know MySQL for many years tried to convince us all that foreign keys were silly. Well they lost that argument and its too bad AR is still trying to make it, but whatever. Even though this schema requires a relationship with an explicit "no FK constraint" in order to function, I'm going to make a guess that AR's history as a non-FK oriented tool is what led to the adoption of patterns like this in the first place.

Well, your DBA will never take you seriously again, but lets go with it (for now). The "join" condition implied by the "user" query above is:

addressable_id=<associated id>; AND addressable_type=<string identifier>

So that we may start constructing SQL expressions the SQLAlchemy way, lets write out the users and orders tables:

users = Table("users", metadata,
    Column('id', Integer, primary_key=True),
    Column('name', String(50), nullable=False)
    )

orders = Table("orders", metadata,
    Column('id', Integer, primary_key=True),
    Column('description', String(50), nullable=False)
)

Each table has a surrogate primary key and just one column of actual data for example's sake. Placeholder classes for both entities:

class User(object):
    pass

class Order(object):
    pass

From what we've seen so far, it wont be possible for an Address object to exist without having both an addressable_type and an addressable_id attribute. SQLAlchemy takes care of "foreign key" relationships (even though we don't yet have a real "foreign key" set up), but lets add the "type" attribute to the constructor since we know we'll have to populate that ourselves:

class Address(object):
    def __init__(self, type):
        self.addressable_type = type

Next, we know we're going to have a little bit of code-intensive stuff to set up the relationship to both User and Order (and whatever else), including making it look as much like the AR example as possible, so lets set up a function addressable which will do the work of the "addressable interface". In this function, we want to have the ability to mark a class as "addressable" using either a one-to-one or one-to-many (many-to-many is of course totally doable, but that would require a slightly different table setup, so we aren't addressing that here). We also want to define the create_address() method on the source class which will both create a new Address object as well as properly associate it with the parent instance, and we want to be able to fetch the parent instance given an Address object, so we will need some backref (bi-directional relationship) action in there too. Additionally, our table has no foreign key constraints on it yet we need to set up a relationship. This will have to be expressed manually. We can generically define the join condition between any table and the addresses table via:

primaryjoin = and_(
        list(table.primary_key)[0] == addresses.c.addressable_id,
        addresses.c.addressable_type == table.name
 )

Above, we are using the primary_key attribute of the source table, converted from an unordered set to a list and pulling the sole column from that list, to get at the single primary key column of the source table (we are assuming a single-column primary key for source tables, since thats all that addressable_id can handle referencing to). The second part of the criterion relates the addressable_type column to the string name of the table. We're using that only because its a string name that happens to be conveniently available; substitute other naming schemes according to taste.

The full function then looks like:

def addressable(cls, name, uselist=True):
    """addressable 'interface'."""

    def create_address(self):
        a = Address(table.name)
        if uselist:
            getattr(self, name).append(a)
        else:
            setattr(self, name, a)
        return a

    cls.create_address = create_address

    mapper = class_mapper(cls)
    table = mapper.local_table

    # no constraints.  therefore define constraints in an ad-hoc fashion.
    primaryjoin = and_(
            list(table.primary_key)[0] == addresses.c.addressable_id,
            addresses.c.addressable_type == table.name
     )

    foreign_keys = [addresses.c.addressable_id]

    mapper.add_property(name, relation(
            Address,
            primaryjoin=primaryjoin, uselist=uselist, foreign_keys=foreign_keys,
            backref=backref(
                '_backref_%s' % table.name,
                primaryjoin=list(table.primary_key)[0] == addresses.c.addressable_id,
                foreign_keys=foreign_keys
            )
        )
    )

The addressable function is given the source class, the name under which the Address-referencing scalar or collection attribute should be placed, and the uselist flag indicating if we are doing one-to-one or one-to-many. The first thing it does is create a create_address() function which is attached to the source class as an instance method; when invoked, this method will instantiate an Address with its proper type string and associate it with the parent using the scalar or list semantics specified. We then retrieve the Mapper and Table associated with the source class and set up a new relation() relating the source class directly to the Address class, and add it to the source mapper using the desired attribute name. We also set up a backref which has a more unusual name given to it.

Both the forward and backwards refs use a constructed primary join condition and an explicit "foreign key" column. How can we say theres a foreign key involved ? The tables themselves have no FKs whatsoever. Well, turns out that mappers and relations don't really care if there are any actual FKs defined for the tables (or if its even impossible to define any, as is the case here). They only care what columns you'd like to <em>pretend</em> have a foreign key constraint for the purposes of this relationship, if there aren't any for real (or even if there are). So in all cases, each relation added to a source class' mapper should consider the addressable_id column to be the "foreign key" column, at least as far as the relationship between a particular source row and address row is concerned. The same goes for the backwards reference which is essentially identical except its looked at from the reverse direction.

Let's look at the backref. First of all, its join condition is lacking the comparison to the addressable_type column. While it still works if the comparison is present, its not actually needed in the case of the backreference since it's invoked from the perspective of a row in the addresses table, and it actually gets in the way of the backref's lazy loader optimizing its reverse-load for a parent User or Order already present in the session. I certainly didn't guess this ahead of time, I only know it because I tried with the full condition to start with, and I watched the SQL generated as I ran my test program (this notion of experimenting and observing to see what works was the inspiration behind the word "alchemy", BTW, just don't blow up your kitchen).

The backref itself has a name which is derived from the source table name. This implies that after several source classes have had addressable applied to them, Address will have several mapped properties of the form _backref_<name>, such as:

# get an address
someaddress = session.query(Address).get(5)

# see if it has an associated Order object
order = someaddress._backref_orders

# see if it has an associated User object
user = someaddress._backref_users

Thats a pretty unpleasant interface above. We actually don't have to guess what kind of object an Address is related to, because it already has an addressable_type attribute. Lets redefine the Address class with a handy property that will give us the correct associated source instance:

class Address(object):
    def __init__(self, type):
        self.addressable_type = type
    member = property(lambda self: getattr(self, '_backref_%s' % self.addressable_type))

mapper(Address, addresses)

And that, adding in a simple mapper def for Address is it...as far as our addressable interface. One hardwired version of a polymorphic association creator. Lets create mappers for User and Order and use our new toy:

mapper(User, users)
addressable(User, 'addresses', uselist=True)

mapper(Order, orders)
addressable(Order, 'address', uselist=False)

Notice that we didn't do the usual properties={} thing when setting up the Address relation, as the addressable function calls add_property() on the source mapper for us. This is because we needed the attribute name to set up our create_address() method as well, and we only want to have to type the concrete attribute name once. Its not quite as nice as AR's minimalist syntax, but its a custom job at this point; in a couple of weeks someone will probably stick this feature into Elixir and it'll be just as pleasing all over again. Plus we're not done with it yet anyway.

So everything is set up; lets fire it up:

metadata.create_all()

and create a session and manipulate some objects (note to any AR users reading this: before you cringe at the verbosity of using Session and Query objects [not to mention the whole Table and mapper() dichotomy], it should be noted that both SA as well as Elixir provide myriad ways of creating more succinct code than this...we are showing "raw SA" here, which while it's the most verbose is also the most explicit and flexible, and is also the style everyone knows):

u1 = User()
u1.name = 'bob'

o1 = Order()
o1.description = 'order 1'

a1 = u1.create_address()
a1.street = '123 anywhere street'
a2 = u1.create_address()
a2.street = '345 orchard ave'

a3 = o1.create_address()
a3.street = '444 park ave.'

sess = create_session()
sess.save(u1)
sess.save(o1)
sess.flush()

sess.clear()

# query objects, get their addresses

bob = sess.query(User).get_by(name='bob')
assert [s.street for s in bob.addresses] == ['123 anywhere street', '345 orchard ave']

order = sess.query(Order).get_by(description='order 1')
assert order.address.street == '444 park ave.'

# query from Address to members

for address in sess.query(Address).list():
    print "Street", address.street, "Member", address.member

Running the application without SQL logging gives us the output:

Street 123 anywhere street Member <__main__.User object at 0x11e3750>
Street 345 orchard ave Member <__main__.User object at 0x11e3750>
Street 444 park ave. Member <__main__.Order object at 0x11e3b10>

And so ends Part I, how to mimic AR's exact method of "polymorphic association". It still feels a little empty knowing that it violates normalization. Or if not, take my word for it...you won't impress Yahoo's development team with this schema. Which brings us to the lightning round:

Part II. How to Really Do This.

It shouldn't be too hard to generalize out the addressable function we created above to provide the polymorphic-association-feature-ala-ActiveRecord as a pluggable module. Before we do that, lets fix the foreign key constraint issue. The AR article has a comment suggesting that, "Well, if you really need a constraint, put a trigger". Which again, is a little naive as to the full functionality of a foreign key (CASCADE, anyone?). In our case, we are going to "flip around" some aspects of the relationship such that the addresses table will be a totally normal table, and each source table will instead have a foreign key column referencing the association.

How will we do that, and still allow one-to-many relationships ? By doing what the name "association" implies - adding an association table. This new table will express part of what we're currently trying to express between just two tables.

For those following at home, the script re-begins right underneath the BoundMetaData declaration, with all new tables. First, the association table:

address_associations = Table("address_associations", metadata,
    Column('assoc_id', Integer, primary_key=True),
    Column('type', String(50), nullable=False)
)

In this table, the association itself becomes an entity of its own with its own primary key, as well as a type column which will serve a similar, but less intrusive, purpose as the previous addressable_type column. For the one-to-one and one-to-many use cases we are currently addressing, the addresses table can be expressed as o2m from this association table (as before, for a many-to-many association, yet another table is needed, which we won't go into in this article):

addresses = Table("addresses", metadata,
    Column('id', Integer, primary_key=True),
    Column('assoc_id', None, ForeignKey('address_associations.assoc_id')),
    Column('street', String(100)),
    Column('city', String(50)),
    Column('country', String(50))
    )

Back into familiar territory, where a table relation can have a foreign key. Note the lesser known trick we are using of defining the type as None for a column which has a foreign key; the resulting type will be taken from the referenced column. Now, when defining orders and users tables, they will take on the burden of expressing relationships to this association, by having the foreign key on their side:

users = Table("users", metadata,
    Column('id', Integer, primary_key=True),
    Column('name', String(50), nullable=False),
    Column('assoc_id', None, ForeignKey('address_associations.assoc_id'))
    )

orders = Table("orders", metadata,
    Column('id', Integer, primary_key=True),
    Column('description', String(50), nullable=False),
    Column('assoc_id', None, ForeignKey('address_associations.assoc_id'))
    )

This puts the "polymorphism" where it should be; on the part of the schema that is actually polymorphic. Lets see what our sample data looks like with this scheme:

users:                  user_id   name   assoc_id
                       ----------------------------
                         1        bob        1
                         2        ed         2

orders:                order_id    description   assoc_id
                       ------------------------------------
                         1          order #1         3
                         5          order #2        NULL

address_associations:  assoc_id   type
                       ------------------
                          1       'user'
                          2       'user'
                          3       'order'

addresses:               id   assoc_id   street
                       -----------------------------------------
                          1       1    123 anywhere street
                          2       1    345 orchard ave
                          3       2    27 fulton st.
                          4       3    444 park ave.

So the above data shows that the addresses table loses its previously complex role. As an additional advantage, the limitation that all source tables (i.e. users and orders tables) need have a single-column surrogate primary key goes away as well (see how it all falls together when you follow the rules?)

Keen users might notice another interesting facet of this schema: which is that the relation from address_associations to the users and orders table is identical to a <strong>joined table inheritance</strong> pattern, where users and orders both <em>inherit</em> from address_associations (save for the NULL value for order #2, for which we could have a placeholder row in address_associations). Suddenly, the original premise that AR's "polymorphic associations" are nothing like SA's "polymorphic inheritance" starts to melt away. It's not that they're so different, its just how AR was doing it made it look less like the traditional inheritance pattern.

In fact, while this table scheme is exactly joined table inheritance, and we could certainly go the "straight" route of creating an Addressable base class and mapper from which the User and Order classes/mappers derive, then creating the traditional SA "polymorphic mapping" using UNION ALL (or whatever surprise 0.4 has in store) with a relationship to Address, here we're going to do it differently. Namely, because we are still going to look at this association as a "cross-cutting concern" rather than an "is-a" relationship, and also because SA's explicit inheritance features only support single inheritance, and we'd rather not occupy the "inherits" slot with a relationship that is at best a mixin, not an "is-a".

So lets define a new class AddressAssoc which we can use to map to the address_associations table, and a new addressable function to replace the old one. This function will apply a little bit more Python property magic in order to create collection- or scalar-based property on the source class which completely conceals the usage of AddressAssoc:

class AddressAssoc(object):
    def __init__(self, name):
        self.type = name

def addressable(cls, name, uselist=True):
    """addressable 'interface'."""

    mapper = class_mapper(cls)
    table = mapper.local_table
    mapper.add_property('address_rel', relation(AddressAssoc, backref=backref('_backref_%s' % table.name, uselist=False)))

    if uselist:
        # list based property decorator
        def get(self):
            if self.address_rel is None:
                self.address_rel = AddressAssoc(table.name)
            return self.address_rel.addresses
        setattr(cls, name, property(get))
    else:
        # scalar based property decorator
        def get(self):
            return self.address_rel.addresses[0]
        def set(self, value):
            if self.address_rel is None:
                self.address_rel = AddressAssoc(table.name)
            self.address_rel.addresses = [value]
        setattr(cls, name, property(get, set))

The basic idea of the relationship is exactly the same as before; we will be adding distinct backrefs to the AddressAssoc object for each kind of relation configured. But observe that now that we're back in normalized constraint-land, the relationship is totally simple and requires no explicit joins or ad-hoc foreign key arguments. The "database" part of this operation is now just one line of code, and the rest is just some property-based "make-it-pretty" code to make usage more convenient. While the property code is a little verbose, overall the whole thing is conceptually simpler.

The member property on Address looks just about the same as before, save for that its pulling the backref off of its association object rather than itself, as defined below along with the mappers for Address and AddressAssoc:

class Address(object):
    member = property(lambda self: getattr(self.association, '_backref_%s' % self.association.type))

mapper(Address, addresses)

mapper(AddressAssoc, address_associations, properties={
    'addresses':relation(Address, backref='association'),
})

The Address object is now freely creatable with no weird create_address() function needed (although we probably could have found a way to do away with it in the first example too).

Lets do the same test code, minus the usage of create_address(), which of course we could add back in if we wanted it that way. Otherwise, the test is exactly the same, and the results are the same:

class User(object):
    pass

mapper(User, users)
addressable(User, 'addresses', uselist=True)

class Order(object):
    pass

mapper(Order, orders)
addressable(Order, 'address', uselist=False)

metadata.create_all()

u1 = User()
u1.name = 'bob'

o1 = Order()
o1.description = 'order 1'

a1 = Address()
u1.addresses.append(a1)
a1.street = '123 anywhere street'

a2 = Address()
u1.addresses.append(a2)
a2.street = '345 orchard ave'

o1.address = Address()
o1.address.street = '444 park ave.'

sess = create_session()
sess.save(u1)
sess.save(o1)
sess.flush()

sess.clear()

# query objects, get their addresses

bob = sess.query(User).get_by(name='bob')
assert [s.street for s in bob.addresses] == ['123 anywhere street', '345 orchard ave']

order = sess.query(Order).get_by(description='order 1')
assert order.address.street == '444 park ave.'

# query from Address to members

for address in sess.query(Address).list():
    print "Street", address.street, "Member", address.member

Output is again:

Street 123 anywhere street Member <__main__.User object at 0x11fb050>
Street 345 orchard ave Member <__main__.User object at 0x11fb050>
Street 444 park ave. Member <__main__.Order object at 0x11f3110>

And there you have it, polymorphic associations, SQLAlchemy style.

Both of these examples, as well as a third version that "genericizes" the address function, are checked into the /examples directory of SQLAlchemy.

poly_assoc_1.py - AR version.

poly_assoc_2.py - Normalized version.

poly_assoc_3.py - Generic version.

discriminator_on_association.py - New in 0.7 ! Modernized declarative form of polymorphic association.