While the ORM side of SQLAlchemy is what most of our users focus
on, SA has always intended to provide a broader set of tools
than just that; referring to it as an "ORM" is like referring to
the internet as "the web". Philosophically we aim to approach
issues in a broad, "no-shortcuts" way that takes a long time to
perfect but once it starts hitting the "sweet spot", a lot of
capability starts falling out of it for free.
The expression language and the ORM's way of using it is one
such example; the ORM considers everything in terms of SQL
expressions which are all ultimately user-defined, and attempts
to be as unconstrained as possible while at the same time being
as informative as possible about what constraints are
unavoidable. To support this, the expression language supports
"transformative" behavior which allow an application, not just
SA's ORM but any app, to deal with SQL generation in the
abstract. That is, not just abstracted from the particular
syntax of a particular database, but abstracted from the
structure of the expressions themselves.
This is clearly a work in progress with caveats galore, but with
each major version of SA the main ideas have become more refined
and clear in their intent, starting with some murky
implementations early on that could only be used in extremely
specific operations, later evolving into core concepts from
which most of the ORM's querying ability builds from. Here I
want to introduce three concepts that are now commonplace
throughout the SA core.
Column Correspondence
This is the ability to relate the columns in one SQL expression
to another. Consider these two expressions; each ultimately
select from the same table and return the same results:
SELECT id, name FROM usertable
SELECT ua.uid, ua.uname from (select id as uid, name as uname from usertable) AS ua
When executed, the cursor.description of each statement
contains the names of the result columns. The first statement
will have the names id and name, whereas the second will
have the names uid and uname. Column correspondence allows
us to match id to uid and name to uname. Starting with
the table definition and expressions for these two selects:
>>> from sqlalchemy import *
>>> meta = MetaData()
>>> users = Table('usertable', meta,
... Column('id', Integer, primary_key=True),
... Column('name', String(50)),
... )
>>> s1 = select([users.c.id.label('uid'), users.c.id.label('uname')])
>>> ua = s1.alias('ua')
>>> s2 = select([ua.c.uid, ua.c.uname])
The columns in the two selects can be related to the base table
as well as each other using corresponding_column():
>>> print users.corresponding_column(s2.c.uid)
usertable.id
>>> print s2.corresponding_column(s1.c.uname)
uname
With a so-called "composite" expression that contains multiple
SELECT statements, the behavior is a little more intricate, as
in the UNION below which takes two different tables and melds
them together:
SELECT id AS uid, name AS uname FROM (
SELECT id, name FROM usertable WHERE name='jack'
UNION ALL
SELECT id, email_address FROM addresses WHERE id=12
)
The representation in expression form, adding in the addresses
table to start:
>>> addresses = Table('addresses', meta,
... Column('id', Integer, primary_key=True),
... Column('email_address', String(100)),
... Column('user_id', Integer, ForeignKey('usertable.id')))
>>> s1 = users.select().where(users.c.name=='jack')
>>> s2 = select([addresses.c.id, addresses.c.email_address]).where(addresses.c.id==12)
>>> u = union_all(s1, s2)
>>> s3 = select([u.c.id.label('uid'), u.c.name.label('uname')])
Now, what column does s3.c.uid correspond to ? In the above
union, its derived from both users.c.id and addresses.c.id
All the paths you'd expect work:
>>> print addresses.corresponding_column(s3.c.uid)
addresses.id
>>> print users.corresponding_column(s3.c.uid)
usertable.id
>>> print addresses.corresponding_column(s3.c.uname)
addresses.email_address
>>> print s3.corresponding_column(users.c.id)
uid
The corresponding_column() method makes usage of a set
associated with every column called proxy_set. Whenever a
column is derived from another, it's given a collection called
proxies which contains the originating column, or in the case
of a union, all originators. proxy_set is then created
on-demand and is the union of the current column's proxies
collection with the proxies collection of each column in that
collection, recursively up to the original column; basically a
linked list. To see if the paths of two columns c1 and c2
overlap, we check if the intersection of c1.proxy_set and
c2.proxy_set is non-empty. Multiple hits can be resolved by
taking the largest intersection.
Column correspondence is used all over the place in SQLAlchemy.
One way it's used is to adapt result rows from one selectable to
another. Suppose we plugged into a test database, and created
the standard "adjacency list" table nodes:
>>> meta.bind = create_engine('sqlite://')
>>> nodes = Table('nodes', meta,
... Column('node_id', Integer, primary_key=True),
... Column('parent_id', Integer, ForeignKey('nodes.node_id')),
... Column('data', String(50)))
>>> nodes.create()
Then, we want to load all nodes, as well as their children and
grandchildren, using outer joins. To join a table to itself
requires aliases, such as this SQL:
SELECT nodes.node_id, nodes.parent_id, nodes.data,
n2.node_id, n2.parent_id, n2.data,
n3.node_id, n3.parent_id, n3.data
FROM
nodes
LEFT OUTER JOIN nodes AS n2 ON nodes.node_id=n2.parent_id
LEFT OUTER JOIN nodes AS n3 ON n2.node_id=n3.parent_id
Then suppose we want a function that can print out the
attributes of a "node" for us, given a result row. If we were
just selecting three columns, we could do this:
>>> def print_node(row):
... print "Id:", row[0], "Parent id:", row[1], "Data:", row[2]
We can see that approach won't work for our query above, as we
need to locate three sets of columns at different positions.
While we could try passing in numerical clues as to where the
columns we want are, or rely upon naming schemes, SQLAlchemy
most cleanly abstracts away column targeting by allowing you to
use the actual Column objects which comprise the expression's
"columns" clause. Building our query as an expression:
>>> n2 = nodes.alias('n2')
>>> n3 = nodes.alias('n3')
>>> all_nodes = nodes.outerjoin(n2, nodes.c.node_id==n2.c.parent_id).\
... outerjoin(n3, n2.c.node_id==n3.c.parent_id).select(use_labels=True)
The use_labels sets up unique names for each column, which is
a current "implementation detail"; SQLAlchemy targets columns
ultimately based on string names, so they need to be unique.
While positional targeting would remove this detail, concerns
over free-text expressions which might throw off the column
count have so far prevented this switch.
We would want to define our print_node as below, referencing
only the node table. The function has no awareness of the n2
and n3 alias constructs:
>>> def print_node(row):
... print "Id:", row[nodes.c.node_id], \
... "Parent id:", row[nodes.c.parent_id], \
... "Data:", row[nodes.c.data]
To differentiate among columns that correspond to nodes, n2,
or n3, SQLAlchemy uses a function built on top of
corresponding_column called row_adapter, which returns a
callable that can translate a row from one selectable to
another. So usage would look like this:
>>> from sqlalchemy.sql.util import row_adapter
>>> n2_adapter = row_adapter(n2, nodes)
>>> n3_adapter = row_adapter(n3, nodes)
>>> for row in all_nodes.execute():
... print_node(row)
... print_node(n2_adapter(row))
... print_node(n3_adapter(row))
print_node is used three times with the same row, each time
given a different interpretation of that row provided by the
adapter. What row_adapter() does here is provide a decoupling
layer between code which knows how a selectable was constructed
and code which consumes result rows from that selectable.
A more wacky example. Suppose we have a regular mapping between
User and Address, with a one-to-many addresses matched by
a many-to-one users. This is the same example in the
SQLAlchemy ORM tutorial and dozens of unit tests:
>>> from sqlalchemy.orm import *
>>> class User(object):pass
>>> class Address(object):pass
>>> mapper(User, users, properties={'addresses':relation(Address, backref='user')})
>>> mapper(Address, addresses)
So the DBAs send you a large UNION query that returns for you
one address record per user id represented in the addresses
table; either the address row that has the domain
"@blooger.biz", or if no such address exists for that user, the
address row with the highest primary key value. The query below
accomplishes this by selecting first all the "blooger.biz" rows,
then unioning that result with the "max id" result which
explicitly omits the "blogger.biz" rows:
SELECT addresses.id, addresses.user_id, addresses.email_address
FROM addresses WHERE
addresses.email_address.like('%@blooger.biz')
UNION ALL
SELECT addresses.id, addresses.user_id, addresses.email_address
FROM addresses JOIN
(SELECT MAX(addresses.id) AS maxid
FROM addresses GROUP BY addresses.user_id) AS max_ids
ON addresses.id=max_ids.maxid
WHERE NOT EXISTS(SELECT 1 FROM addresses WHERE
user_id=addresses.user_id AND
addresses.email_address.like('%@blooger.biz'))
How do we "drop in" this select statement into an ORM query with
a minimum of headache ? We could certainly work from the string
representation above directly, just say
query.from_statement(<text>) and be done with it. But we're
going to want to do more with this query in the next section,
where the Query will be able to intelligently manipulate the
expression further, so for that we build a SQL construct out of
it. We build such an expression the same way any DBA would
design the above SQL directly, working from the inside out:
>>> max_ids = select([func.max(addresses.c.id).label('maxid')]).\
... group_by(addresses.c.user_id).alias('max_ids')
>>> aalias = addresses.alias()
>>> aview = union_all(
... addresses.select().where(addresses.c.email_address.endswith('@blooger.biz')),
... addresses.select().\
... select_from(addresses.join(max_ids, addresses.c.id==max_ids.c.maxid)).\
... where(
... ~exists([1], and_(
... addresses.c.user_id==aalias.c.user_id,
... aalias.c.email_address.endswith('@blooger.biz')))
... )
... )
With our statement above, we can drop it into the
select_from() method on Query, which is a little bit like
from_statement() except the Query will be ready to
manipulate the incoming expression as though it were the primary
mapped table:
>>> sess = create_session()
>>> sess.query(Address).select_from(aview)
When the above query fetches rows, the "row adaption" we
discussed in the last section is applied to each incoming row,
mapping the aview selectable back to the addresses table.
This way, our Address mapper, which still only knows about the
addresses table, can target columns against the addresses
table when they are in fact adapted from columns targeted
against the aliased aview selectable. In a concrete sense it
means that result columns which are labeled (or if we were doing
positional column targeting, positioned) corresponding to Column
objects present in the aview construct can referenced with
Column objects from the addresses table. When joins and eager
loads are added on to this query, its very likely that the same
row will be translated using many row adapters at result time,
often translating different columns into the same underlying
table, as is the case in our nodes example.
Clause Adaption
Clause adaption builds upon the idea of corresponding_column()
to replace parts of an expression which "correspond" to another
expression to that of the other expression. Originally this
operation supported columns only and only worked with simplistic
column-oriented expressions, but in recent months it's been
expanded to support the replacement of pretty much any kind of
expression, and has moved from an outerlying edge case into a
core concept which provides the ability to query from
"polymorphic" selectables (which select for several different
mappers encapsulated in a subquery), to create aliased joins, as
well as to allow joins to or from any arbitrary selectable; all
generated strictly on the known relationships between the
"underlying" tables.
Using our nodes example, we can apply filtering criterion
constructed from the nodes table object and apply it towards
the aliases of the nodes table. Such as, to load all nodes who
have a grandchild containing the data field "crackers"; the
desired SQL is:
SELECT nodes.node_id, nodes.parent_id, nodes.data
FROM nodes JOIN nodes as children ON nodes.node_id=children.parent_id
JOIN nodes as grandchildren ON children.node_id=grandchildren.parent_id
WHERE grandchildren.data="crackers"
If we were given the SQL construct:
>>> children = nodes.alias('children')
>>> grandchildren = nodes.alias('grandchildren')
>>> n = nodes.select().\
... select_from(nodes.join(children, nodes.c.node_id==children.c.parent_id).\
... join(grandchildren, children.c.node_id==grandchildren.c.parent_id))
we can apply the criterion nodes.c.data=='crackers',
constructed without knowledge of the grandchildren alias, to
that alias using adaption:
>>> from sqlalchemy.sql.util import ClauseAdapter
>>> criterion = nodes.c.data=='crackers'
>>> print criterion
nodes.data = :nodes_data
>>> adapted = ClauseAdapter(grandchildren).traverse(criterion)
>>> print adapted
grandchildren.data = :nodes_data
>>> print n.where(adapted)
SELECT nodes.node_id, nodes.parent_id, nodes.data
FROM nodes JOIN nodes AS children ON nodes.node_id = children.parent_id
JOIN nodes AS grandchildren ON children.node_id = grandchildren.parent_id
WHERE grandchildren.data = ?
Above, the ClauseAdapter object is created against the
grandchildren alias. When it's given a clause to traverse(),
it produces a copy of that clause with all elements that
"correspond" to grandchildren replaced with the corresponding
element, in this case the nodes.data column is replaced with
grandchildren.data.
The above process is the same thing that happens automatically
if you had a mapping setup for nodes, and used aliased joins
as follows:
session.query(Node).join('children', 'children', aliased=True).\
filter(Node.data=='crackers')
Let's take a look at how adaption applies to our users and
addresses example. Those familiar with SQLAlchemy relations
know that the "join condition" between the users and
addresses tables is users.c.id==addresses.c.user_id. When we
work with these mappers, anytime SA joins between a User and
Address entity, that join condition is all it knows about how
these two tables connect. Since SA never wants to assume things
fit into an artificially constrained pattern, we can't
"hardcode" to the fact that its really just a primary
key/foreign key pair, as it can just as easily be a more
complicated user-defined expression.
Clause adaption comes in when we start issuing joins between
selectables which are derived from the base mapped tables.
Suppose that when we execute the big addresses query and get
back Address objects, we also want to get the User which
maps to each address. We can of course construct this query
manually, but we can also join() on the user relation as
always, and clause adaption will take care of the fact that we
are selecting from a big UNION and not the plain addresses
table. Using compile() on Query to see the generated SQL
reveals:
>>> print sess.query(Address).select_from(aview).join('user').add_entity(User).compile()
#!sql
SELECT
anon_1.id AS anon_1_id, anon_1.email_address AS anon_1_email_address, anon_1.user_id AS anon_1_user_id,
usertable.id AS usertable_id, usertable.name AS usertable_name
FROM
(SELECT addresses.id AS id, addresses.email_address AS email_address, addresses.user_id AS user_id
FROM addresses
WHERE addresses.email_address LIKE ?
UNION ALL
SELECT addresses.id AS id, addresses.email_address AS email_address, addresses.user_id AS user_id
FROM addresses JOIN
(SELECT max(addresses.id) AS maxid FROM addresses GROUP BY addresses.user_id) AS max_ids ON addresses.id = max_ids.maxid
WHERE NOT (EXISTS (
SELECT 1 FROM addresses AS addresses_1 WHERE addresses.user_id = addresses_1.user_id AND addresses_1.email_address LIKE ?
))) AS anon_1
JOIN usertable ON usertable.id = anon_1.user_id ORDER BY anon_1.oid
The important part above is the join condition at the bottom;
the join('user') call which would normally produce addresses
JOIN usertable ON usertable.id = addresses.user_id was adapted
against the aview selectable to produce the ON clause of
usertable.id = anon_1.user_id; additionally the ORDER BY,
normally ORDER BY addresses.oid was aliased to be ORDER BY
anon_1.oid.
Aliasing like this happens to a more significant degree when we
combine eager loading with LIMIT/OFFSET, where the LIMITED query
must be wrapped in a selectable which joins against the child
table or tables without the limit applied:
>>> print sess.query(Address).options(eagerload_all('user.addresses')).limit(3).compile()
#!sql
SELECT
anon_1.addresses_id AS anon_1_addresses_id,
anon_1.addresses_email_address AS anon_1_addresses_email_address,
anon_1.addresses_user_id AS anon_1_addresses_user_id,
addresses_1.id AS addresses_1_id,
addresses_1.email_address AS addresses_1_email_address,
addresses_1.user_id AS addresses_1_user_id,
usertable_1.id AS usertable_1_id,
usertable_1.name AS usertable_1_name
FROM (
SELECT
addresses.id AS addresses_id,
addresses.email_address AS addresses_email_address,
addresses.user_id AS addresses_user_id,
addresses.oid AS addresses_oid
FROM addresses ORDER BY addresses.oid LIMIT 3 OFFSET 0
) AS anon_1
LEFT OUTER JOIN usertable AS usertable_1 ON usertable_1.id = anon_1.addresses_user_id
LEFT OUTER JOIN addresses AS addresses_1 ON usertable_1.id = addresses_1.user_id
ORDER BY anon_1.oid, usertable_1.oid, addresses_1.oid
To formulate the above query, Query applied these adaptions:
The left side of "usertable.id = addresses.user_id" was
aliased to become "usertable_1.id = addresses.user_id", to
join from the primary table to the users table.
The "usertable.id = addresses.user_id" was aliased in a
separate instance to become "usertable_1.id =
addresses_1.user_id", to join from the eagerly loaded users
to the eagerly loaded child addresses.
and:
addresses
LEFT OUTER JOIN usertable AS usertable_1 ON usertable_1.id = addresses.user_id
LEFT OUTER JOIN addresses AS addresses_1 ON usertable_1.id = addresses_1.user_id
became (note the full text of anon_1, our LIMITed subquery, in place of addresses):
(SELECT
addresses.id AS addresses_id,
addresses.email_address AS addresses_email_address,
addresses.user_id AS addresses_user_id,
addresses.oid AS addresses_oid
FROM addresses ORDER BY addresses.oid LIMIT 3 OFFSET 0) AS anon_1
LEFT OUTER JOIN usertable AS usertable_1 ON usertable_1.id = anon_1.addresses_user_id
LEFT OUTER JOIN addresses AS addresses_1 ON usertable_1.id = addresses_1.user_id
the final adaption adapted the right side of
usertable_1.id=addresses.user_id as well as the left side of
the larger join.
Column Reduction
Column reduction means that a collection of columns, such as in
the columns clause of a SELECT statement, are reduced to the
smallest number of meaningful columns. Meaningful here means
that no column in the collection has a foreign key relationship
to any other column in the collection, nor does the selectable
have any WHERE criterion that would relate the two columns to be
always the same value. This function is used by the Join
construct as well as the ORM to construct the most minimal set
of primary key columns for an expression.
Consider the example of "people", "engineers" and "managers".
This is an "inheriting" configuration where "engineers" and
"managers" both join to the common information in the "people"
table:
>>> people = Table('people', meta,
... Column('person_id', Integer, primary_key=True),
... Column('name', String(50)),
... Column('type', String(30)))
>>> engineers = Table('engineers', meta,
... Column('engineer_id', Integer, ForeignKey('people.person_id'), primary_key=True),
... Column('engineer_name', String(50)),
... Column('primary_language', String(50)),
... )
>>> managers = Table('managers', meta,
... Column('manager_id', Integer, ForeignKey('people.person_id'), primary_key=True),
... Column('manager_name', String(50))
... )
A join which selects all rows from all tables outerjoins both
engineers and managers to people:
>>> all_people = people.outerjoin(engineers).outerjoin(managers)
If we select from this join, we get:
>>> print all_people.select()
#!sql
SELECT
people.person_id, people.name, people.type,
engineers.engineer_id, engineers.engineer_name, engineers.primary_language,
managers.manager_id, managers.manager_name
FROM people
LEFT OUTER JOIN engineers ON people.person_id = engineers.engineer_id
LEFT OUTER JOIN managers ON people.person_id = managers.manager_id
What's the most minimal primary key we can determine from the
above join ? Rolling up all primary key columns is a little too
naive:
>>> print [str(c) for c in all_people.c if c.primary_key]
['people.person_id', 'engineers.engineer_id', 'managers.manager_id']
Because engineer_id and manager_id are always going to have
the same value as person_id (or be NULL). The difference here
determines whether we'd need to say query.get([2,None,2]) to
load person number 2, or just query.get([2]) (note that we
don't account for the idea that both engineers and managers
might point to the same person_id here, because we are
specifically talking about a "joined table inheritance" pattern
which does not support "multiple" inheritance).
The reduce_columns() function steps through and gives us just
what we want:
>>> from sqlalchemy.sql.util import reduce_columns
>>> print [str(c) for c in reduce_columns(
... [col for col in all_people.c if col.primary_key])]
['people.person_id']
reduce_columns() can also do its work given a set of criterion
or clauses to use:
>>> s = select([engineers, managers]).\
... where(engineers.c.engineer_name==managers.c.manager_name)
>>> print [str(c) for c in reduce_columns(list(s.c), s)]
['engineer_id', 'engineer_name', 'primary_language', 'manager_id']
Above, the manager_name column is removed since the WHERE
criterion of the select states that they're equivalent.
There's a bunch of other functions, but these three took the
longest time to get right (and clause adaption, which has
improved massively, still has some ambiguous cases that need to
be refined). So in future posts which talk about "adapting the
join" or "translating rows", now you know what I mean.