Selecting Booleans

September 09, 2008 at 01:29 PM | Code, SQLAlchemy

Ticket 798, allow conjunctions to act as column elements, is complete after a long wait, finally attended to for the great reason that I suddenly needed this feature myself ;). A few tweaks and we can now talk about asking yes/no questions of our database.

As we head into the 0.5 era of SQLAlchemy, one theme is that if we're using the ORM, we're going to need select() constructs a lot less, if at all. Most things can now be done using Query objects. The other day, I needed to ask my database a question regarding if some particular data were available, and that's it. The query needed to be as fast as possible so I wanted to use EXISTS, so that the database needs to only access the first row of the selected rows in order to give an answer. Additionally, I needed to ask this question of a few different sets of criterion, which could be most efficiently achieved by combining them together into a single statement using OR.

Using the current trunk of SQLAlchemy 0.5, we can use Query() against boolean values:

from sqlalchemy import *
from sqlalchemy.orm import *
engine = create_engine('sqlite://', echo=True)
Session = scoped_session(sessionmaker(bind=engine))

true, false = literal(True), literal(False)

(ret, ), = Session.query(true)
print ret

which returns:

True

The question we just asked our SQLite database is SELECT 1, where "1" is SQLite's way of representing true (in Postgres it's true, in MySQL it can be 1 or true. Using literal(True) means we don't have to worry about this). SQLite tells us, "yes, True is True". Note the way I'm getting the row out of Query. At the moment that seems to be a fun way to go, but you could also say:

ret, = Session.query(true).one()

and of course:

ret = Session.query(true).one()[0]

We can also ask our database about boolean values combined with OR and AND:

>>> (ret, ), = Session.query(or_(true, false))
>>> ret
True

>>> (ret, ), = Session.query(and_(true, false))
>>> ret
False

To demonstrate EXISTS, let's build a table. We're using declarative, which is pretty much the only way I roll these days (hey, better late than never....):

from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Keyword(Base):
    __tablename__ = "keyword"
    id = Column(Integer, primary_key=True)
    name = Column(String(255))

Base.metadata.create_all(engine)

One way we could find out if our Keyword table has any of a certain set of keywords, is to do something familiar and ask for a count:

keywords = ["beans", "lentils", "legumes"]
(ret, ), = Session.query(func.count(Keyword.id)).filter(Keyword.name.in_(keywords))

But to eliminate the need for the database to actually count the full set of rows, we can ask it just if any rows at all exist, using the exists() construct:

(ret, ), = Session.query(exists().where(Keyword.name.in_(keywords)))

Which issues the SQL:

SELECT EXISTS (SELECT * FROM keyword WHERE keyword.name IN (?, ?, ?)) AS anon_1

The return value is boolean, indicating True for rows were found, False for no rows were found. In my case, I was querying among a bunch of tables. Let's suppose we have a database which represents cookbooks and individual recipes, both of which have lists of keywords associated. Let's declare that:

recipe_keywords = Table("recipe_keyword", Base.metadata,
                    Column("recipe_id", Integer, ForeignKey("recipe.id")),
                    Column("keyword_id", Integer, ForeignKey("keyword.id"))
                    )

cookbook_keywords = Table("cookbook_keyword", Base.metadata,
                    Column("cookbook_id", Integer, ForeignKey("cookbook.id")),
                    Column("keyword_id", Integer, ForeignKey("keyword.id"))
                    )

class Recipe(Base):
    __tablename__ = "recipe"
    id = Column(Integer, primary_key=True)
    description = Column(Text)
    keywords = relation(Keyword, secondary=recipe_keywords)

class Cookbook(Base):
    __tablename__ = "cookbook"
    id = Column(Integer, primary_key=True)
    description = Column(Text)
    keywords = relation(Keyword, secondary=cookbook_keywords)

Base.metadata.create_all(engine)

The above schema defines a recipe and a cookbook table, each of which relate to keyword via the recipe_keyword or cookbook_keyword association tables, respectively.

The question of "Do any recipes feature any of our above three keywords?" can be answered by:

(ret, ), = Session.query(
    exists().where(Recipe.id==recipe_keywords.c.recipe_id).
            where(recipe_keywords.c.keyword_id==Keyword.id).
            where(Keyword.name.in_(keywords))
)

Above we ask "do any rows exist, where the ID of the recipe table matches the recipe ID of the recipe_keywords table, and the keyword ID of the recipe_keywords table matches the ID of a keyword table row, and the name of the keyword is in the list ["beans", "lentils", "legumes"]". SQL is:

SELECT EXISTS (SELECT *
FROM recipe, recipe_keyword, keyword
WHERE (recipe.id = recipe_keyword.recipe_id AND recipe_keyword.keyword_id = keyword.id)
AND keyword.name IN (?, ?, ?)) AS anon_1

Spelling out the full join from Recipe to Keyword above is a little bit verbose. We have the option to let SQLAlchemy create this for us using the Recipe.keywords relation already set up, using the ORM level join() function, which provides direct access to the ORM's join algorithm:

from sqlalchemy.orm import join
(ret, ), = Session.query(
        exists().select_from(join(Recipe, Keyword, Recipe.keywords)).
                where(Keyword.name.in_(keywords))
)

This generates:

SELECT EXISTS (SELECT *
FROM recipe JOIN recipe_keyword AS recipe_keyword_1 ON
recipe.id = recipe_keyword_1.recipe_id JOIN keyword ON
keyword.id = recipe_keyword_1.keyword_id
WHERE keyword.name IN (?, ?, ?)) AS anon_1

We now know how to ask the database if it has any recipe rows which relate to a given set of keyword names. To ask the database if it has any cookbook rows or recipe rows matching our keywords at the same time, we can double up on exists() clauses using or_():

(ret, ), = Session.query(or_(
        exists().select_from(join(Recipe, Keyword, Recipe.keywords)).
                where(Keyword.name.in_(keywords)),

        exists().select_from(join(Cookbook, Keyword, Cookbook.keywords)).
            where(Keyword.name.in_(keywords))
))

At this point, our eyes begin to glaze over when viewing the SQL itself. But that's fine; just remember that whenever this happens, somewhere in the world another "ORM's aren't worth it" blog post fades away:

SELECT ((EXISTS (SELECT *
FROM recipe JOIN recipe_keyword AS recipe_keyword_1 ON recipe.id = recipe_keyword_1.recipe_id
JOIN keyword ON keyword.id = recipe_keyword_1.keyword_id
WHERE keyword.name IN (?, ?, ?))) OR (EXISTS (SELECT *
FROM cookbook JOIN cookbook_keyword AS cookbook_keyword_1 ON
cookbook.id = cookbook_keyword_1.cookbook_id JOIN keyword ON
keyword.id = cookbook_keyword_1.keyword_id
WHERE keyword.name IN (?, ?, ?)))) AS anon_1

Those readers who are familiar with some of SQLA's advanced Query operators might recognize that the above EXISTS queries look a lot like an any() expression. This is because, they are! Though in this case, almost. As a review, an any() expression creates an EXISTS clause which is correlated with the enclosing query, such as:

rows = Session.query(Recipe).filter(Recipe.keywords.any(Keyword.name.in_(keywords))).all()

This query produces a regular SELECT from the recipe table, then embeds a correlated EXISTS inside the WHERE clause:

SELECT recipe.id AS recipe_id, recipe.description AS recipe_description
FROM recipe
WHERE EXISTS (SELECT 1
FROM recipe_keyword, keyword
WHERE (recipe.id = recipe_keyword.recipe_id AND keyword.id = recipe_keyword.keyword_id) AND
keyword.name IN (?, ?, ?))

When any() is used, it explicitly states that the exists() clause should correlate() to the recipe table, which allows it to work in scenarios where SQLAlchemy's usual "auto" correlation cannot make the right decision (we have tests which illustrate this). So to use any() in our "column-based" approach, we just need to turn off that correlation using correlate(None). Our "recipes/cookbooks which exist" query can be stated as:

(ret, ), = Session.query(or_(
    Recipe.keywords.any(Keyword.name.in_(keywords)).correlate(None),
    Cookbook.keywords.any(Keyword.name.in_(keywords)).correlate(None)
    ))

While the any() approach above is nice, the correlation part of it has me preferring the more explicit exists() version.


SQLAlchemy code swarm

June 25, 2008 at 01:38 PM | Code, SQLAlchemy

Django's already got one, and Brian Rosner made one for us too: link. SQLA's pre-release development is apparent here which is why it floats around "zzzeek" for so long.


Reddit.com Goes Open Source

June 18, 2008 at 10:07 AM | Code, Mako/Pylons, SQLAlchemy

Reddit has opened their source up and we can now see just what they've been up to. It's been known for some time that Reddit was (re)built using Pylons and Mako templates, contrary to their FAQ which still states that they use web.py. As it turns out, they've also built something of their own database layer, which seems to include a homegrown caching layer and ultimately is built on top of SQLAlchemy, using the SQLA expression language to generate queries. Connections are served with the QueuePool, and they use the threadlocal setting, so that they can get implicit access to transactions in progress. They vertically partition their database access among four separate engines across four distinct areas of functionality on the site.

This is currently the highest volume website I'm aware of using SQLAlchemy and Pylons, and is a testament to the stability of our core components (I hope). Python in general is not too prominent in New York City where I work; Java, PHP and .NET are still the default "goto" platforms, and most developers here look at you kind of funny when you mention Python. Look how well-known Java advocate Ted Neward says even Python!, as though we're the most fringe Java alternative imaginable. I hope examples like Reddit continue to illustrate that Python presents the best mix of performance, stability, and rapid development for web development today, not to mention one of the broadest software ecosystems in the field (which I've always maintained is a good thing).


Pycon 08 Wrapup

March 21, 2008 at 01:48 PM | Code, SQLAlchemy, Talks

Jason and I spent a full eight days in Chicago this year, with a full slate of activities. While we didn't commit to SQLAlchemy sprints ahead of time (and were therefore not officially rostered), we sprinted the entire time on SA and had picked up two or three folks to sprint with us, as well as helped with some "parallel" SQLAlchemy-related sprints - next year we'll definitely plan ahead of time so that people can get involved sooner and more explicitly.

Pycon started off for SA with a big full day of tutorials, with the work split up among myself, Jason, and Jonathan Ellis. Jonathan did his beginner tutorial, Jason and I did the advanced. The huge winner for this tutorial was the slide runner we came up with ("we" means, I wrote a little 20 line header to page through a Python script one chunk at a time, Jason ran 100 miles further with the idea to turn it into a full blown interactive Python prompt). Using the runner, everyone in the room could keep hitting "enter" and each chunk of code on the projector would come out on their screen and execute itself, leaving them at a Python prompt where they could further explore the constructs that were just created. Some people just sat back and watched the code go by, others dug in and answered our exercise questions with it.

Next up was my "SQLAlchemy 0.4 and Beyond" talk where my goal was to bring people up to speed on where we're at right now, including where we came from, some of the problems we had, and what we've done about them, then segueing into some deeper examples from the current release and into some of the more interesting projects that are underway.

Onto the sprints. By the time they started, we had already met with various people and attended a few BOFs as well. Here's a rundown of everything going on:

  • Jython - Frank Wierzbicki, recently hired by Sun to work full time on Jython, had presented his Jython integration of SQLAlchemy. Frank and I could not share the same goals more closely; both coming from an "enterprisey" Java background, we are well aware that there's a vast world of Java developers who would flock to Python if a Java-compatible yet Pythonic toolset were available, particularly including an ORM with features comparable to Hibernate. Jason and Frank met up during the sprints to align their architectural paths, as Jason has been working very hard on a full-blown multi-dialect architecture, which will allow the maximum amount of reuse of dialects among any number of DBAPI connectors. An example is to use the same MySQL compiler with MySQLDB, JDBC, and pyODBC. Stub modules which account for the idiosyncrasies of each DBAPI would draw upon the central MySQL functionality. So with a little more effort that's underway, SQLAlchemy should work fully out of the box on Jython trunk in the very near future.
  • Django - yup, a lot went on with Django and SQLAlchemy this year. We met with a whole room full of Djangoers who've expressed the ongoing desire for Django to have a tighter integration layer with SQLAlchemy available as an option. Michael Trier spearheaded the effort, showing us some of the work he and his team had already completed with their own django-sqlalchemy layer. We sat down and shared a lot of notes and ideas, including using the new SQLAlchemy "Declarative" plugin as a guide for developing their own solution, as well as some strategies for best integrating SQLAlchemy connection/transaction management with Django's own transaction middleware. We had an initial mini-sprint during an earlier BOF and then we communicated throughout the main sprints. Jason has hinted he might give Django a try on an upcoming project which would give us a great chance to give it a test spin and fine tune it.
  • Zope - I was thrilled to spend pretty much the entire sprint alongside Christian Theune, a very experienced developer from whom I sought to learn as much as possible. His company already uses a SQLAlchemy/Zope integrated application, and he was very interested in the new custom instrumentation branch, which was started to enable Philip Eby to integrate SQLAlchemy with Trellis. Christian worked through the branch, which presents an entirely open-ended way to redefine how SQLAlchemy classes are instrumented (or not), to make it such that all access to the mapped class and its instances, including SQLA's own private attributes, may be mediated through external code, thus allowing his application to use Zope security proxies to broker access to all end-user instances and classes. At the same time, Christian is also interested in full session rollback capability, something we were already starting to implement, so we're also finally getting deep into letting SA rollback the internal state of its Session in a clean way, so that rollbacks within transactions or SAVEPOINTs will leave you with a session that remains fully usable without needing to expire its contents. The comprehensive rollback feature will appear as an option sometime within the 0.4 series and be on by default in 0.5.
  • Pylons/Turbogears - This is of course our main stomping ground. SQLA sprints took place in the same room as the Pylons TG sprint and there was much beer and margaritas consumed. Chris Perkins of DBSprokets fame has signed on with myself and Mark Ramm to make some progress on the Prentice Hall SQLAlchemy book....but for the impatient there will already be an Oreilly book available in June.

All in all Pycon was fantastic (I missed the controversial lightning talks), the crowd was great (and much bigger), and I got to meet a whole lot of fans..particularly Chris McAvoy who's probably the most old school - he was one of the very first Myghty users. Don't weep for Myghty though, if you use Pylons with Mako, they're both direct descendants.


Expression Transformations

January 23, 2008 at 11:12 PM | Code, SQLAlchemy

While the ORM side of SQLAlchemy is what most of our users focus on, SA has always intended to provide a broader set of tools than just that; referring to it as an "ORM" is like referring to the internet as "the web". Philosophically we aim to approach issues in a broad, "no-shortcuts" way that takes a long time to perfect but once it starts hitting the "sweet spot", a lot of capability starts falling out of it for free.

The expression language and the ORM's way of using it is one such example; the ORM considers everything in terms of SQL expressions which are all ultimately user-defined, and attempts to be as unconstrained as possible while at the same time being as informative as possible about what constraints are unavoidable. To support this, the expression language supports "transformative" behavior which allow an application, not just SA's ORM but any app, to deal with SQL generation in the abstract. That is, not just abstracted from the particular syntax of a particular database, but abstracted from the structure of the expressions themselves.

This is clearly a work in progress with caveats galore, but with each major version of SA the main ideas have become more refined and clear in their intent, starting with some murky implementations early on that could only be used in extremely specific operations, later evolving into core concepts from which most of the ORM's querying ability builds from. Here I want to introduce three concepts that are now commonplace throughout the SA core.

Column Correspondence

This is the ability to relate the columns in one SQL expression to another. Consider these two expressions; each ultimately select from the same table and return the same results:

SELECT id, name FROM usertable

SELECT ua.uid, ua.uname from (select id as uid, name as uname from usertable) AS ua

When executed, the cursor.description of each statement contains the names of the result columns. The first statement will have the names id and name, whereas the second will have the names uid and uname. Column correspondence allows us to match id to uid and name to uname. Starting with the table definition and expressions for these two selects:

>>> from sqlalchemy import *
>>> meta = MetaData()
>>> users = Table('usertable', meta,
...     Column('id', Integer, primary_key=True),
...     Column('name', String(50)),
...     )

>>> s1 = select([users.c.id.label('uid'), users.c.id.label('uname')])
>>> ua = s1.alias('ua')
>>> s2 = select([ua.c.uid, ua.c.uname])

The columns in the two selects can be related to the base table as well as each other using corresponding_column():

>>> print users.corresponding_column(s2.c.uid)
usertable.id
>>> print s2.corresponding_column(s1.c.uname)
uname

With a so-called "composite" expression that contains multiple SELECT statements, the behavior is a little more intricate, as in the UNION below which takes two different tables and melds them together:

SELECT id AS uid, name AS uname FROM (
    SELECT id, name FROM usertable WHERE name='jack'

    UNION ALL

    SELECT id, email_address FROM addresses WHERE id=12
)

The representation in expression form, adding in the addresses table to start:

>>> addresses = Table('addresses', meta,
...     Column('id', Integer, primary_key=True),
...     Column('email_address', String(100)),
...     Column('user_id', Integer, ForeignKey('usertable.id')))

>>> s1 = users.select().where(users.c.name=='jack')
>>> s2 = select([addresses.c.id, addresses.c.email_address]).where(addresses.c.id==12)
>>> u = union_all(s1, s2)
>>> s3 = select([u.c.id.label('uid'), u.c.name.label('uname')])

Now, what column does s3.c.uid correspond to ? In the above union, its derived from both users.c.id and addresses.c.id All the paths you'd expect work:

>>> print addresses.corresponding_column(s3.c.uid)
addresses.id
>>> print users.corresponding_column(s3.c.uid)
usertable.id
>>> print addresses.corresponding_column(s3.c.uname)
addresses.email_address
>>> print s3.corresponding_column(users.c.id)
uid

The corresponding_column() method makes usage of a set associated with every column called proxy_set. Whenever a column is derived from another, it's given a collection called proxies which contains the originating column, or in the case of a union, all originators. proxy_set is then created on-demand and is the union of the current column's proxies collection with the proxies collection of each column in that collection, recursively up to the original column; basically a linked list. To see if the paths of two columns c1 and c2 overlap, we check if the intersection of c1.proxy_set and c2.proxy_set is non-empty. Multiple hits can be resolved by taking the largest intersection.

Column correspondence is used all over the place in SQLAlchemy. One way it's used is to adapt result rows from one selectable to another. Suppose we plugged into a test database, and created the standard "adjacency list" table nodes:

>>> meta.bind = create_engine('sqlite://')
>>> nodes = Table('nodes', meta,
...     Column('node_id', Integer, primary_key=True),
...     Column('parent_id', Integer, ForeignKey('nodes.node_id')),
...     Column('data', String(50)))
>>> nodes.create()

Then, we want to load all nodes, as well as their children and grandchildren, using outer joins. To join a table to itself requires aliases, such as this SQL:

SELECT nodes.node_id, nodes.parent_id, nodes.data,
       n2.node_id, n2.parent_id, n2.data,
       n3.node_id, n3.parent_id, n3.data
FROM
       nodes
       LEFT OUTER JOIN nodes AS n2 ON nodes.node_id=n2.parent_id
       LEFT OUTER JOIN nodes AS n3 ON n2.node_id=n3.parent_id

Then suppose we want a function that can print out the attributes of a "node" for us, given a result row. If we were just selecting three columns, we could do this:

>>> def print_node(row):
...    print "Id:", row[0], "Parent id:", row[1], "Data:", row[2]

We can see that approach won't work for our query above, as we need to locate three sets of columns at different positions. While we could try passing in numerical clues as to where the columns we want are, or rely upon naming schemes, SQLAlchemy most cleanly abstracts away column targeting by allowing you to use the actual Column objects which comprise the expression's "columns" clause. Building our query as an expression:

>>> n2 = nodes.alias('n2')
>>> n3 = nodes.alias('n3')
>>> all_nodes = nodes.outerjoin(n2, nodes.c.node_id==n2.c.parent_id).\
...     outerjoin(n3, n2.c.node_id==n3.c.parent_id).select(use_labels=True)

The use_labels sets up unique names for each column, which is a current "implementation detail"; SQLAlchemy targets columns ultimately based on string names, so they need to be unique. While positional targeting would remove this detail, concerns over free-text expressions which might throw off the column count have so far prevented this switch.

We would want to define our print_node as below, referencing only the node table. The function has no awareness of the n2 and n3 alias constructs:

>>> def print_node(row):
...    print "Id:", row[nodes.c.node_id], \
...    "Parent id:", row[nodes.c.parent_id], \
...    "Data:", row[nodes.c.data]

To differentiate among columns that correspond to nodes, n2, or n3, SQLAlchemy uses a function built on top of corresponding_column called row_adapter, which returns a callable that can translate a row from one selectable to another. So usage would look like this:

>>> from sqlalchemy.sql.util import row_adapter
>>> n2_adapter = row_adapter(n2, nodes)
>>> n3_adapter = row_adapter(n3, nodes)
>>> for row in all_nodes.execute():
...    print_node(row)
...    print_node(n2_adapter(row))
...    print_node(n3_adapter(row))

print_node is used three times with the same row, each time given a different interpretation of that row provided by the adapter. What row_adapter() does here is provide a decoupling layer between code which knows how a selectable was constructed and code which consumes result rows from that selectable.

A more wacky example. Suppose we have a regular mapping between User and Address, with a one-to-many addresses matched by a many-to-one users. This is the same example in the SQLAlchemy ORM tutorial and dozens of unit tests:

>>> from sqlalchemy.orm import *
>>> class User(object):pass
>>> class Address(object):pass
>>> mapper(User, users, properties={'addresses':relation(Address, backref='user')})
>>> mapper(Address, addresses)

So the DBAs send you a large UNION query that returns for you one address record per user id represented in the addresses table; either the address row that has the domain "@blooger.biz", or if no such address exists for that user, the address row with the highest primary key value. The query below accomplishes this by selecting first all the "blooger.biz" rows, then unioning that result with the "max id" result which explicitly omits the "blogger.biz" rows:

SELECT addresses.id, addresses.user_id, addresses.email_address
FROM addresses WHERE
addresses.email_address.like('%@blooger.biz')

UNION ALL

SELECT addresses.id, addresses.user_id, addresses.email_address
FROM addresses JOIN
(SELECT MAX(addresses.id) AS maxid
  FROM addresses GROUP BY addresses.user_id) AS max_ids
ON addresses.id=max_ids.maxid
WHERE NOT EXISTS(SELECT 1 FROM addresses WHERE
       user_id=addresses.user_id AND
       addresses.email_address.like('%@blooger.biz'))

How do we "drop in" this select statement into an ORM query with a minimum of headache ? We could certainly work from the string representation above directly, just say query.from_statement(<text>) and be done with it. But we're going to want to do more with this query in the next section, where the Query will be able to intelligently manipulate the expression further, so for that we build a SQL construct out of it. We build such an expression the same way any DBA would design the above SQL directly, working from the inside out:

>>> max_ids = select([func.max(addresses.c.id).label('maxid')]).\
...       group_by(addresses.c.user_id).alias('max_ids')
>>> aalias = addresses.alias()
>>> aview = union_all(
...  addresses.select().where(addresses.c.email_address.endswith('@blooger.biz')),
...  addresses.select().\
...    select_from(addresses.join(max_ids, addresses.c.id==max_ids.c.maxid)).\
...    where(
...    ~exists([1], and_(
...        addresses.c.user_id==aalias.c.user_id,
...        aalias.c.email_address.endswith('@blooger.biz')))
...    )
... )

With our statement above, we can drop it into the select_from() method on Query, which is a little bit like from_statement() except the Query will be ready to manipulate the incoming expression as though it were the primary mapped table:

>>> sess = create_session()
>>> sess.query(Address).select_from(aview)

When the above query fetches rows, the "row adaption" we discussed in the last section is applied to each incoming row, mapping the aview selectable back to the addresses table. This way, our Address mapper, which still only knows about the addresses table, can target columns against the addresses table when they are in fact adapted from columns targeted against the aliased aview selectable. In a concrete sense it means that result columns which are labeled (or if we were doing positional column targeting, positioned) corresponding to Column objects present in the aview construct can referenced with Column objects from the addresses table. When joins and eager loads are added on to this query, its very likely that the same row will be translated using many row adapters at result time, often translating different columns into the same underlying table, as is the case in our nodes example.

Clause Adaption

Clause adaption builds upon the idea of corresponding_column() to replace parts of an expression which "correspond" to another expression to that of the other expression. Originally this operation supported columns only and only worked with simplistic column-oriented expressions, but in recent months it's been expanded to support the replacement of pretty much any kind of expression, and has moved from an outerlying edge case into a core concept which provides the ability to query from "polymorphic" selectables (which select for several different mappers encapsulated in a subquery), to create aliased joins, as well as to allow joins to or from any arbitrary selectable; all generated strictly on the known relationships between the "underlying" tables.

Using our nodes example, we can apply filtering criterion constructed from the nodes table object and apply it towards the aliases of the nodes table. Such as, to load all nodes who have a grandchild containing the data field "crackers"; the desired SQL is:

SELECT nodes.node_id, nodes.parent_id, nodes.data
FROM nodes JOIN nodes as children ON nodes.node_id=children.parent_id
JOIN nodes as grandchildren ON children.node_id=grandchildren.parent_id
WHERE grandchildren.data="crackers"

If we were given the SQL construct:

>>> children = nodes.alias('children')
>>> grandchildren = nodes.alias('grandchildren')
>>> n = nodes.select().\
...     select_from(nodes.join(children, nodes.c.node_id==children.c.parent_id).\
...     join(grandchildren, children.c.node_id==grandchildren.c.parent_id))

we can apply the criterion nodes.c.data=='crackers', constructed without knowledge of the grandchildren alias, to that alias using adaption:

>>> from sqlalchemy.sql.util import ClauseAdapter
>>> criterion = nodes.c.data=='crackers'

>>> print criterion
nodes.data = :nodes_data

>>> adapted = ClauseAdapter(grandchildren).traverse(criterion)
>>> print adapted
grandchildren.data = :nodes_data

>>> print n.where(adapted)
SELECT nodes.node_id, nodes.parent_id, nodes.data
FROM nodes JOIN nodes AS children ON nodes.node_id = children.parent_id
JOIN nodes AS grandchildren ON children.node_id = grandchildren.parent_id
WHERE grandchildren.data = ?

Above, the ClauseAdapter object is created against the grandchildren alias. When it's given a clause to traverse(), it produces a copy of that clause with all elements that "correspond" to grandchildren replaced with the corresponding element, in this case the nodes.data column is replaced with grandchildren.data.

The above process is the same thing that happens automatically if you had a mapping setup for nodes, and used aliased joins as follows:

session.query(Node).join('children', 'children', aliased=True).\
   filter(Node.data=='crackers')

Let's take a look at how adaption applies to our users and addresses example. Those familiar with SQLAlchemy relations know that the "join condition" between the users and addresses tables is users.c.id==addresses.c.user_id. When we work with these mappers, anytime SA joins between a User and Address entity, that join condition is all it knows about how these two tables connect. Since SA never wants to assume things fit into an artificially constrained pattern, we can't "hardcode" to the fact that its really just a primary key/foreign key pair, as it can just as easily be a more complicated user-defined expression.

Clause adaption comes in when we start issuing joins between selectables which are derived from the base mapped tables. Suppose that when we execute the big addresses query and get back Address objects, we also want to get the User which maps to each address. We can of course construct this query manually, but we can also join() on the user relation as always, and clause adaption will take care of the fact that we are selecting from a big UNION and not the plain addresses table. Using compile() on Query to see the generated SQL reveals:

>>> print sess.query(Address).select_from(aview).join('user').add_entity(User).compile()
#!sql
SELECT
    anon_1.id AS anon_1_id, anon_1.email_address AS anon_1_email_address, anon_1.user_id AS anon_1_user_id,
    usertable.id AS usertable_id, usertable.name AS usertable_name
FROM
    (SELECT addresses.id AS id, addresses.email_address AS email_address, addresses.user_id AS user_id
    FROM addresses
    WHERE addresses.email_address LIKE ?

    UNION ALL

    SELECT addresses.id AS id, addresses.email_address AS email_address, addresses.user_id AS user_id
    FROM addresses JOIN
        (SELECT max(addresses.id) AS maxid FROM addresses GROUP BY addresses.user_id) AS max_ids ON addresses.id = max_ids.maxid
    WHERE NOT (EXISTS (
        SELECT 1 FROM addresses AS addresses_1 WHERE addresses.user_id = addresses_1.user_id AND addresses_1.email_address LIKE ?
    ))) AS anon_1
JOIN usertable ON usertable.id = anon_1.user_id ORDER BY anon_1.oid

The important part above is the join condition at the bottom; the join('user') call which would normally produce addresses JOIN usertable ON usertable.id = addresses.user_id was adapted against the aview selectable to produce the ON clause of usertable.id = anon_1.user_id; additionally the ORDER BY, normally ORDER BY addresses.oid was aliased to be ORDER BY anon_1.oid.

Aliasing like this happens to a more significant degree when we combine eager loading with LIMIT/OFFSET, where the LIMITED query must be wrapped in a selectable which joins against the child table or tables without the limit applied:

>>> print sess.query(Address).options(eagerload_all('user.addresses')).limit(3).compile()
#!sql
SELECT
    anon_1.addresses_id AS anon_1_addresses_id,
    anon_1.addresses_email_address AS anon_1_addresses_email_address,
    anon_1.addresses_user_id AS anon_1_addresses_user_id,
    addresses_1.id AS addresses_1_id,
    addresses_1.email_address AS addresses_1_email_address,
    addresses_1.user_id AS addresses_1_user_id,
    usertable_1.id AS usertable_1_id,
    usertable_1.name AS usertable_1_name
FROM (
    SELECT
        addresses.id AS addresses_id,
        addresses.email_address AS addresses_email_address,
        addresses.user_id AS addresses_user_id,
        addresses.oid AS addresses_oid
    FROM addresses ORDER BY addresses.oid LIMIT 3 OFFSET 0
    ) AS anon_1
    LEFT OUTER JOIN usertable AS usertable_1 ON usertable_1.id = anon_1.addresses_user_id
    LEFT OUTER JOIN addresses AS addresses_1 ON usertable_1.id = addresses_1.user_id
    ORDER BY anon_1.oid, usertable_1.oid, addresses_1.oid

To formulate the above query, Query applied these adaptions:

  • The left side of "usertable.id = addresses.user_id" was aliased to become "usertable_1.id = addresses.user_id", to join from the primary table to the users table.

  • The "usertable.id = addresses.user_id" was aliased in a separate instance to become "usertable_1.id = addresses_1.user_id", to join from the eagerly loaded users to the eagerly loaded child addresses.

  • and:

    addresses
      LEFT OUTER JOIN usertable AS usertable_1 ON usertable_1.id = addresses.user_id
      LEFT OUTER JOIN addresses AS addresses_1 ON usertable_1.id = addresses_1.user_id
    

    became (note the full text of anon_1, our LIMITed subquery, in place of addresses):

    (SELECT
    addresses.id AS addresses_id,
    addresses.email_address AS addresses_email_address,
    addresses.user_id AS addresses_user_id,
    addresses.oid AS addresses_oid
    FROM addresses ORDER BY addresses.oid LIMIT 3 OFFSET 0) AS anon_1
      LEFT OUTER JOIN usertable AS usertable_1 ON usertable_1.id = anon_1.addresses_user_id
      LEFT OUTER JOIN addresses AS addresses_1 ON usertable_1.id = addresses_1.user_id
    

    the final adaption adapted the right side of usertable_1.id=addresses.user_id as well as the left side of the larger join.

Column Reduction

Column reduction means that a collection of columns, such as in the columns clause of a SELECT statement, are reduced to the smallest number of meaningful columns. Meaningful here means that no column in the collection has a foreign key relationship to any other column in the collection, nor does the selectable have any WHERE criterion that would relate the two columns to be always the same value. This function is used by the Join construct as well as the ORM to construct the most minimal set of primary key columns for an expression.

Consider the example of "people", "engineers" and "managers". This is an "inheriting" configuration where "engineers" and "managers" both join to the common information in the "people" table:

>>> people = Table('people', meta,
...    Column('person_id', Integer, primary_key=True),
...    Column('name', String(50)),
...    Column('type', String(30)))

>>> engineers = Table('engineers', meta,
...   Column('engineer_id', Integer, ForeignKey('people.person_id'), primary_key=True),
...   Column('engineer_name', String(50)),
...   Column('primary_language', String(50)),
...  )

>>> managers = Table('managers', meta,
...   Column('manager_id', Integer, ForeignKey('people.person_id'), primary_key=True),
...   Column('manager_name', String(50))
...   )

A join which selects all rows from all tables outerjoins both engineers and managers to people:

>>> all_people = people.outerjoin(engineers).outerjoin(managers)

If we select from this join, we get:

>>> print all_people.select()
#!sql
SELECT
    people.person_id, people.name, people.type,
    engineers.engineer_id, engineers.engineer_name, engineers.primary_language,
    managers.manager_id, managers.manager_name
FROM people
    LEFT OUTER JOIN engineers ON people.person_id = engineers.engineer_id
    LEFT OUTER JOIN managers ON people.person_id = managers.manager_id

What's the most minimal primary key we can determine from the above join ? Rolling up all primary key columns is a little too naive:

>>> print [str(c) for c in all_people.c if c.primary_key]
['people.person_id', 'engineers.engineer_id', 'managers.manager_id']

Because engineer_id and manager_id are always going to have the same value as person_id (or be NULL). The difference here determines whether we'd need to say query.get([2,None,2]) to load person number 2, or just query.get([2]) (note that we don't account for the idea that both engineers and managers might point to the same person_id here, because we are specifically talking about a "joined table inheritance" pattern which does not support "multiple" inheritance).

The reduce_columns() function steps through and gives us just what we want:

>>> from sqlalchemy.sql.util import reduce_columns
>>> print [str(c) for c in reduce_columns(
...    [col for col in all_people.c if col.primary_key])]
['people.person_id']

reduce_columns() can also do its work given a set of criterion or clauses to use:

>>> s = select([engineers, managers]).\
...   where(engineers.c.engineer_name==managers.c.manager_name)
>>> print [str(c) for c in reduce_columns(list(s.c), s)]
['engineer_id', 'engineer_name', 'primary_language', 'manager_id']

Above, the manager_name column is removed since the WHERE criterion of the select states that they're equivalent.

There's a bunch of other functions, but these three took the longest time to get right (and clause adaption, which has improved massively, still has some ambiguous cases that need to be refined). So in future posts which talk about "adapting the join" or "translating rows", now you know what I mean.