Expression Transformations

January 23, 2008 at 11:12 PM | Code, SQLAlchemy

While the ORM side of SQLAlchemy is what most of our users focus on, SA has always intended to provide a broader set of tools than just that; referring to it as an "ORM" is like referring to the internet as "the web". Philosophically we aim to approach issues in a broad, "no-shortcuts" way that takes a long time to perfect but once it starts hitting the "sweet spot", a lot of capability starts falling out of it for free.

The expression language and the ORM's way of using it is one such example; the ORM considers everything in terms of SQL expressions which are all ultimately user-defined, and attempts to be as unconstrained as possible while at the same time being as informative as possible about what constraints are unavoidable. To support this, the expression language supports "transformative" behavior which allow an application, not just SA's ORM but any app, to deal with SQL generation in the abstract. That is, not just abstracted from the particular syntax of a particular database, but abstracted from the structure of the expressions themselves.

This is clearly a work in progress with caveats galore, but with each major version of SA the main ideas have become more refined and clear in their intent, starting with some murky implementations early on that could only be used in extremely specific operations, later evolving into core concepts from which most of the ORM's querying ability builds from. Here I want to introduce three concepts that are now commonplace throughout the SA core.

Column Correspondence

This is the ability to relate the columns in one SQL expression to another. Consider these two expressions; each ultimately select from the same table and return the same results:

SELECT id, name FROM usertable

SELECT ua.uid, ua.uname from (select id as uid, name as uname from usertable) AS ua

When executed, the cursor.description of each statement contains the names of the result columns. The first statement will have the names id and name, whereas the second will have the names uid and uname. Column correspondence allows us to match id to uid and name to uname. Starting with the table definition and expressions for these two selects:

>>> from sqlalchemy import *
>>> meta = MetaData()
>>> users = Table('usertable', meta,
...     Column('id', Integer, primary_key=True),
...     Column('name', String(50)),
...     )

>>> s1 = select([users.c.id.label('uid'), users.c.id.label('uname')])
>>> ua = s1.alias('ua')
>>> s2 = select([ua.c.uid, ua.c.uname])

The columns in the two selects can be related to the base table as well as each other using corresponding_column():

>>> print users.corresponding_column(s2.c.uid)
usertable.id
>>> print s2.corresponding_column(s1.c.uname)
uname

With a so-called "composite" expression that contains multiple SELECT statements, the behavior is a little more intricate, as in the UNION below which takes two different tables and melds them together:

SELECT id AS uid, name AS uname FROM (
    SELECT id, name FROM usertable WHERE name='jack'

    UNION ALL

    SELECT id, email_address FROM addresses WHERE id=12
)

The representation in expression form, adding in the addresses table to start:

>>> addresses = Table('addresses', meta,
...     Column('id', Integer, primary_key=True),
...     Column('email_address', String(100)),
...     Column('user_id', Integer, ForeignKey('usertable.id')))

>>> s1 = users.select().where(users.c.name=='jack')
>>> s2 = select([addresses.c.id, addresses.c.email_address]).where(addresses.c.id==12)
>>> u = union_all(s1, s2)
>>> s3 = select([u.c.id.label('uid'), u.c.name.label('uname')])

Now, what column does s3.c.uid correspond to ? In the above union, its derived from both users.c.id and addresses.c.id All the paths you'd expect work:

>>> print addresses.corresponding_column(s3.c.uid)
addresses.id
>>> print users.corresponding_column(s3.c.uid)
usertable.id
>>> print addresses.corresponding_column(s3.c.uname)
addresses.email_address
>>> print s3.corresponding_column(users.c.id)
uid

The corresponding_column() method makes usage of a set associated with every column called proxy_set. Whenever a column is derived from another, it's given a collection called proxies which contains the originating column, or in the case of a union, all originators. proxy_set is then created on-demand and is the union of the current column's proxies collection with the proxies collection of each column in that collection, recursively up to the original column; basically a linked list. To see if the paths of two columns c1 and c2 overlap, we check if the intersection of c1.proxy_set and c2.proxy_set is non-empty. Multiple hits can be resolved by taking the largest intersection.

Column correspondence is used all over the place in SQLAlchemy. One way it's used is to adapt result rows from one selectable to another. Suppose we plugged into a test database, and created the standard "adjacency list" table nodes:

>>> meta.bind = create_engine('sqlite://')
>>> nodes = Table('nodes', meta,
...     Column('node_id', Integer, primary_key=True),
...     Column('parent_id', Integer, ForeignKey('nodes.node_id')),
...     Column('data', String(50)))
>>> nodes.create()

Then, we want to load all nodes, as well as their children and grandchildren, using outer joins. To join a table to itself requires aliases, such as this SQL:

SELECT nodes.node_id, nodes.parent_id, nodes.data,
       n2.node_id, n2.parent_id, n2.data,
       n3.node_id, n3.parent_id, n3.data
FROM
       nodes
       LEFT OUTER JOIN nodes AS n2 ON nodes.node_id=n2.parent_id
       LEFT OUTER JOIN nodes AS n3 ON n2.node_id=n3.parent_id

Then suppose we want a function that can print out the attributes of a "node" for us, given a result row. If we were just selecting three columns, we could do this:

>>> def print_node(row):
...    print "Id:", row[0], "Parent id:", row[1], "Data:", row[2]

We can see that approach won't work for our query above, as we need to locate three sets of columns at different positions. While we could try passing in numerical clues as to where the columns we want are, or rely upon naming schemes, SQLAlchemy most cleanly abstracts away column targeting by allowing you to use the actual Column objects which comprise the expression's "columns" clause. Building our query as an expression:

>>> n2 = nodes.alias('n2')
>>> n3 = nodes.alias('n3')
>>> all_nodes = nodes.outerjoin(n2, nodes.c.node_id==n2.c.parent_id).\
...     outerjoin(n3, n2.c.node_id==n3.c.parent_id).select(use_labels=True)

The use_labels sets up unique names for each column, which is a current "implementation detail"; SQLAlchemy targets columns ultimately based on string names, so they need to be unique. While positional targeting would remove this detail, concerns over free-text expressions which might throw off the column count have so far prevented this switch.

We would want to define our print_node as below, referencing only the node table. The function has no awareness of the n2 and n3 alias constructs:

>>> def print_node(row):
...    print "Id:", row[nodes.c.node_id], \
...    "Parent id:", row[nodes.c.parent_id], \
...    "Data:", row[nodes.c.data]

To differentiate among columns that correspond to nodes, n2, or n3, SQLAlchemy uses a function built on top of corresponding_column called row_adapter, which returns a callable that can translate a row from one selectable to another. So usage would look like this:

>>> from sqlalchemy.sql.util import row_adapter
>>> n2_adapter = row_adapter(n2, nodes)
>>> n3_adapter = row_adapter(n3, nodes)
>>> for row in all_nodes.execute():
...    print_node(row)
...    print_node(n2_adapter(row))
...    print_node(n3_adapter(row))

print_node is used three times with the same row, each time given a different interpretation of that row provided by the adapter. What row_adapter() does here is provide a decoupling layer between code which knows how a selectable was constructed and code which consumes result rows from that selectable.

A more wacky example. Suppose we have a regular mapping between User and Address, with a one-to-many addresses matched by a many-to-one users. This is the same example in the SQLAlchemy ORM tutorial and dozens of unit tests:

>>> from sqlalchemy.orm import *
>>> class User(object):pass
>>> class Address(object):pass
>>> mapper(User, users, properties={'addresses':relation(Address, backref='user')})
>>> mapper(Address, addresses)

So the DBAs send you a large UNION query that returns for you one address record per user id represented in the addresses table; either the address row that has the domain "@blooger.biz", or if no such address exists for that user, the address row with the highest primary key value. The query below accomplishes this by selecting first all the "blooger.biz" rows, then unioning that result with the "max id" result which explicitly omits the "blogger.biz" rows:

SELECT addresses.id, addresses.user_id, addresses.email_address
FROM addresses WHERE
addresses.email_address.like('%@blooger.biz')

UNION ALL

SELECT addresses.id, addresses.user_id, addresses.email_address
FROM addresses JOIN
(SELECT MAX(addresses.id) AS maxid
  FROM addresses GROUP BY addresses.user_id) AS max_ids
ON addresses.id=max_ids.maxid
WHERE NOT EXISTS(SELECT 1 FROM addresses WHERE
       user_id=addresses.user_id AND
       addresses.email_address.like('%@blooger.biz'))

How do we "drop in" this select statement into an ORM query with a minimum of headache ? We could certainly work from the string representation above directly, just say query.from_statement(<text>) and be done with it. But we're going to want to do more with this query in the next section, where the Query will be able to intelligently manipulate the expression further, so for that we build a SQL construct out of it. We build such an expression the same way any DBA would design the above SQL directly, working from the inside out:

>>> max_ids = select([func.max(addresses.c.id).label('maxid')]).\
...       group_by(addresses.c.user_id).alias('max_ids')
>>> aalias = addresses.alias()
>>> aview = union_all(
...  addresses.select().where(addresses.c.email_address.endswith('@blooger.biz')),
...  addresses.select().\
...    select_from(addresses.join(max_ids, addresses.c.id==max_ids.c.maxid)).\
...    where(
...    ~exists([1], and_(
...        addresses.c.user_id==aalias.c.user_id,
...        aalias.c.email_address.endswith('@blooger.biz')))
...    )
... )

With our statement above, we can drop it into the select_from() method on Query, which is a little bit like from_statement() except the Query will be ready to manipulate the incoming expression as though it were the primary mapped table:

>>> sess = create_session()
>>> sess.query(Address).select_from(aview)

When the above query fetches rows, the "row adaption" we discussed in the last section is applied to each incoming row, mapping the aview selectable back to the addresses table. This way, our Address mapper, which still only knows about the addresses table, can target columns against the addresses table when they are in fact adapted from columns targeted against the aliased aview selectable. In a concrete sense it means that result columns which are labeled (or if we were doing positional column targeting, positioned) corresponding to Column objects present in the aview construct can referenced with Column objects from the addresses table. When joins and eager loads are added on to this query, its very likely that the same row will be translated using many row adapters at result time, often translating different columns into the same underlying table, as is the case in our nodes example.

Clause Adaption

Clause adaption builds upon the idea of corresponding_column() to replace parts of an expression which "correspond" to another expression to that of the other expression. Originally this operation supported columns only and only worked with simplistic column-oriented expressions, but in recent months it's been expanded to support the replacement of pretty much any kind of expression, and has moved from an outerlying edge case into a core concept which provides the ability to query from "polymorphic" selectables (which select for several different mappers encapsulated in a subquery), to create aliased joins, as well as to allow joins to or from any arbitrary selectable; all generated strictly on the known relationships between the "underlying" tables.

Using our nodes example, we can apply filtering criterion constructed from the nodes table object and apply it towards the aliases of the nodes table. Such as, to load all nodes who have a grandchild containing the data field "crackers"; the desired SQL is:

SELECT nodes.node_id, nodes.parent_id, nodes.data
FROM nodes JOIN nodes as children ON nodes.node_id=children.parent_id
JOIN nodes as grandchildren ON children.node_id=grandchildren.parent_id
WHERE grandchildren.data="crackers"

If we were given the SQL construct:

>>> children = nodes.alias('children')
>>> grandchildren = nodes.alias('grandchildren')
>>> n = nodes.select().\
...     select_from(nodes.join(children, nodes.c.node_id==children.c.parent_id).\
...     join(grandchildren, children.c.node_id==grandchildren.c.parent_id))

we can apply the criterion nodes.c.data=='crackers', constructed without knowledge of the grandchildren alias, to that alias using adaption:

>>> from sqlalchemy.sql.util import ClauseAdapter
>>> criterion = nodes.c.data=='crackers'

>>> print criterion
nodes.data = :nodes_data

>>> adapted = ClauseAdapter(grandchildren).traverse(criterion)
>>> print adapted
grandchildren.data = :nodes_data

>>> print n.where(adapted)
SELECT nodes.node_id, nodes.parent_id, nodes.data
FROM nodes JOIN nodes AS children ON nodes.node_id = children.parent_id
JOIN nodes AS grandchildren ON children.node_id = grandchildren.parent_id
WHERE grandchildren.data = ?

Above, the ClauseAdapter object is created against the grandchildren alias. When it's given a clause to traverse(), it produces a copy of that clause with all elements that "correspond" to grandchildren replaced with the corresponding element, in this case the nodes.data column is replaced with grandchildren.data.

The above process is the same thing that happens automatically if you had a mapping setup for nodes, and used aliased joins as follows:

session.query(Node).join('children', 'children', aliased=True).\
   filter(Node.data=='crackers')

Let's take a look at how adaption applies to our users and addresses example. Those familiar with SQLAlchemy relations know that the "join condition" between the users and addresses tables is users.c.id==addresses.c.user_id. When we work with these mappers, anytime SA joins between a User and Address entity, that join condition is all it knows about how these two tables connect. Since SA never wants to assume things fit into an artificially constrained pattern, we can't "hardcode" to the fact that its really just a primary key/foreign key pair, as it can just as easily be a more complicated user-defined expression.

Clause adaption comes in when we start issuing joins between selectables which are derived from the base mapped tables. Suppose that when we execute the big addresses query and get back Address objects, we also want to get the User which maps to each address. We can of course construct this query manually, but we can also join() on the user relation as always, and clause adaption will take care of the fact that we are selecting from a big UNION and not the plain addresses table. Using compile() on Query to see the generated SQL reveals:

>>> print sess.query(Address).select_from(aview).join('user').add_entity(User).compile()
#!sql
SELECT
    anon_1.id AS anon_1_id, anon_1.email_address AS anon_1_email_address, anon_1.user_id AS anon_1_user_id,
    usertable.id AS usertable_id, usertable.name AS usertable_name
FROM
    (SELECT addresses.id AS id, addresses.email_address AS email_address, addresses.user_id AS user_id
    FROM addresses
    WHERE addresses.email_address LIKE ?

    UNION ALL

    SELECT addresses.id AS id, addresses.email_address AS email_address, addresses.user_id AS user_id
    FROM addresses JOIN
        (SELECT max(addresses.id) AS maxid FROM addresses GROUP BY addresses.user_id) AS max_ids ON addresses.id = max_ids.maxid
    WHERE NOT (EXISTS (
        SELECT 1 FROM addresses AS addresses_1 WHERE addresses.user_id = addresses_1.user_id AND addresses_1.email_address LIKE ?
    ))) AS anon_1
JOIN usertable ON usertable.id = anon_1.user_id ORDER BY anon_1.oid

The important part above is the join condition at the bottom; the join('user') call which would normally produce addresses JOIN usertable ON usertable.id = addresses.user_id was adapted against the aview selectable to produce the ON clause of usertable.id = anon_1.user_id; additionally the ORDER BY, normally ORDER BY addresses.oid was aliased to be ORDER BY anon_1.oid.

Aliasing like this happens to a more significant degree when we combine eager loading with LIMIT/OFFSET, where the LIMITED query must be wrapped in a selectable which joins against the child table or tables without the limit applied:

>>> print sess.query(Address).options(eagerload_all('user.addresses')).limit(3).compile()
#!sql
SELECT
    anon_1.addresses_id AS anon_1_addresses_id,
    anon_1.addresses_email_address AS anon_1_addresses_email_address,
    anon_1.addresses_user_id AS anon_1_addresses_user_id,
    addresses_1.id AS addresses_1_id,
    addresses_1.email_address AS addresses_1_email_address,
    addresses_1.user_id AS addresses_1_user_id,
    usertable_1.id AS usertable_1_id,
    usertable_1.name AS usertable_1_name
FROM (
    SELECT
        addresses.id AS addresses_id,
        addresses.email_address AS addresses_email_address,
        addresses.user_id AS addresses_user_id,
        addresses.oid AS addresses_oid
    FROM addresses ORDER BY addresses.oid LIMIT 3 OFFSET 0
    ) AS anon_1
    LEFT OUTER JOIN usertable AS usertable_1 ON usertable_1.id = anon_1.addresses_user_id
    LEFT OUTER JOIN addresses AS addresses_1 ON usertable_1.id = addresses_1.user_id
    ORDER BY anon_1.oid, usertable_1.oid, addresses_1.oid

To formulate the above query, Query applied these adaptions:

  • The left side of "usertable.id = addresses.user_id" was aliased to become "usertable_1.id = addresses.user_id", to join from the primary table to the users table.

  • The "usertable.id = addresses.user_id" was aliased in a separate instance to become "usertable_1.id = addresses_1.user_id", to join from the eagerly loaded users to the eagerly loaded child addresses.

  • and:

    addresses
      LEFT OUTER JOIN usertable AS usertable_1 ON usertable_1.id = addresses.user_id
      LEFT OUTER JOIN addresses AS addresses_1 ON usertable_1.id = addresses_1.user_id
    

    became (note the full text of anon_1, our LIMITed subquery, in place of addresses):

    (SELECT
    addresses.id AS addresses_id,
    addresses.email_address AS addresses_email_address,
    addresses.user_id AS addresses_user_id,
    addresses.oid AS addresses_oid
    FROM addresses ORDER BY addresses.oid LIMIT 3 OFFSET 0) AS anon_1
      LEFT OUTER JOIN usertable AS usertable_1 ON usertable_1.id = anon_1.addresses_user_id
      LEFT OUTER JOIN addresses AS addresses_1 ON usertable_1.id = addresses_1.user_id
    

    the final adaption adapted the right side of usertable_1.id=addresses.user_id as well as the left side of the larger join.

Column Reduction

Column reduction means that a collection of columns, such as in the columns clause of a SELECT statement, are reduced to the smallest number of meaningful columns. Meaningful here means that no column in the collection has a foreign key relationship to any other column in the collection, nor does the selectable have any WHERE criterion that would relate the two columns to be always the same value. This function is used by the Join construct as well as the ORM to construct the most minimal set of primary key columns for an expression.

Consider the example of "people", "engineers" and "managers". This is an "inheriting" configuration where "engineers" and "managers" both join to the common information in the "people" table:

>>> people = Table('people', meta,
...    Column('person_id', Integer, primary_key=True),
...    Column('name', String(50)),
...    Column('type', String(30)))

>>> engineers = Table('engineers', meta,
...   Column('engineer_id', Integer, ForeignKey('people.person_id'), primary_key=True),
...   Column('engineer_name', String(50)),
...   Column('primary_language', String(50)),
...  )

>>> managers = Table('managers', meta,
...   Column('manager_id', Integer, ForeignKey('people.person_id'), primary_key=True),
...   Column('manager_name', String(50))
...   )

A join which selects all rows from all tables outerjoins both engineers and managers to people:

>>> all_people = people.outerjoin(engineers).outerjoin(managers)

If we select from this join, we get:

>>> print all_people.select()
#!sql
SELECT
    people.person_id, people.name, people.type,
    engineers.engineer_id, engineers.engineer_name, engineers.primary_language,
    managers.manager_id, managers.manager_name
FROM people
    LEFT OUTER JOIN engineers ON people.person_id = engineers.engineer_id
    LEFT OUTER JOIN managers ON people.person_id = managers.manager_id

What's the most minimal primary key we can determine from the above join ? Rolling up all primary key columns is a little too naive:

>>> print [str(c) for c in all_people.c if c.primary_key]
['people.person_id', 'engineers.engineer_id', 'managers.manager_id']

Because engineer_id and manager_id are always going to have the same value as person_id (or be NULL). The difference here determines whether we'd need to say query.get([2,None,2]) to load person number 2, or just query.get([2]) (note that we don't account for the idea that both engineers and managers might point to the same person_id here, because we are specifically talking about a "joined table inheritance" pattern which does not support "multiple" inheritance).

The reduce_columns() function steps through and gives us just what we want:

>>> from sqlalchemy.sql.util import reduce_columns
>>> print [str(c) for c in reduce_columns(
...    [col for col in all_people.c if col.primary_key])]
['people.person_id']

reduce_columns() can also do its work given a set of criterion or clauses to use:

>>> s = select([engineers, managers]).\
...   where(engineers.c.engineer_name==managers.c.manager_name)
>>> print [str(c) for c in reduce_columns(list(s.c), s)]
['engineer_id', 'engineer_name', 'primary_language', 'manager_id']

Above, the manager_name column is removed since the WHERE criterion of the select states that they're equivalent.

There's a bunch of other functions, but these three took the longest time to get right (and clause adaption, which has improved massively, still has some ambiguous cases that need to be refined). So in future posts which talk about "adapting the join" or "translating rows", now you know what I mean.