SQLAlchemy at PyGotham

September 16, 2011 at 12:01 PM | Code, SQLAlchemy, Talks

Just finished my SQLAlchemy talk at PyGotham. In this talk, I get into my latest thinking on SQLAlchemy's usage philosophy and cover a few architectural features that I discuss in my upcoming chapter for Architecture of Open Source Applications. Here's the slides in PDF Form: SQLAlchemy - an Architectural Retrospective. There should eventually be a video at some point I'll keep you posted !

Update - The video is now available.

ORM Thoughts (in < 140 x 5 characters)

June 16, 2011 at 03:29 PM | Code, SQLAlchemy

When I wrote SQLAlchemy, a key intent was as a working rebuttal to the "ORM is vietnam" argument. (edit: for those who don't know where that phrase originates, it starts with this very famous post.)

This was achieved by: a. tweaking the key assumption that relational concepts must be hidden. They don't. Practicality beats purity.

And b. not underestimating the task, i.e. unit of work, cached collections and associations backed by eager loading (solving N+1 problem).

Hibernate does both, but suffers from the limitations of Java - complex, heavyhanded. Python is what makes SQLAlchemy possible.

Our users can now write super-SQL-fluent apps succinctly and quickly; that's all the proof one should need that ORM is definitely worth it.

Magic, a "New" ORM

May 17, 2011 at 07:36 PM | Code, SQLAlchemy

TL;DR - Use SQLAlchemy to create your own Magic.

It's new, and easy! That's why we call it what it is: Magic. A new ORM that keeps things simple. Let's dive in !

from magic import (
            Entity, one_to_many, many_to_one, many_to_many, string
        )

I like this so far ! What does a new model class look like ?

class Parent(Entity):
    children = one_to_many("Child", "child_id",
                            reverse="parent")

No. Way. That's it ? No tables and columns ? No foreign thingies ? Where's the meaningless boilerplate ?

class Child(Entity):
    parent = many_to_one("Parent", "child_id",
                            reverse="children")

    tags = many_to_many("Tag", "child_tag",
                                "child_id",
                                "tag_id")

class Tag(Entity):
    name = string(50)

OK that's a little more chatty but seriously zzzeek, don't you want some weird "==" signs in there ?

Entity.setup_database("sqlite://", create=True)

This is beginning to remind me of washing machines that also have a dryer built inside of them, or those TVs that have VCRs embedded inside the case.

t1, t2, t3 = Tag(name='t1'), Tag(name='t2'), Tag(name='t3')
Entity.session.add(Parent(
        children={
            Child(tags={t1, t2}),
            Child(tags={t1, t3}),
            Child()
        }))
Entity.session.commit()

p1 = Entity.session.query(Parent).first()
for child in p1.children:
    print child, [t.name for t in child.tags]

New-style sets! Hooray for Python.

And...that's it. Magic!

Would you want this ORM ? Or would you want a different one ? Well how does Magic work ? I'm pretty sure you can guess how it starts:

from sqlalchemy import (
            Column, ForeignKey, Table,
            Integer, String, create_engine
        )

There's the zzzeek we know ! Blah blah blah tables, constraints, boring things. Well we might as well get on with it:

from sqlalchemy.orm import (
            class_mapper, mapper, relationship,
            scoped_session, sessionmaker, configure_mappers
        )
from sqlalchemy.ext.declarative import declared_attr, declarative_base
from sqlalchemy import event
import re

I like how "re" is the honored guest of all that SQLAlchemy stuff.

@event.listens_for(mapper, "mapper_configured")
def _setup_deferred_properties(mapper, class_):
    """Listen for finished mappers and apply DeferredProp
    configurations."""

    for key, value in class_.__dict__.items():
        if isinstance(value, DeferredProp):
            value._config(class_, mapper, key)

And here we have our first docstring. What is this "event" you speak of ?

zzzeek says: That's the new thing in 0.7 ! You're going to get a lot of mileage out of it - everything that used to be extension this, listener that, all goes through event. And there's lots of new events added with more on the way.

For this one in particular, just like it says, anytime a new mapper appears, this thing is going to run and....work all the magic.

Well it was nice while it lasted, I guess it's about to get ugly huh.

Deep breath, just a slight pinch:

class DeferredProp(object):
    """A class attribute that generates a mapped attribute
    after mappers are configured."""

    def _setup_reverse(self, key, rel, target_cls):
        """Setup bidirectional behavior between two relationships."""

        reverse = self.kw.get('reverse')
        if reverse:
            reverse_attr = getattr(target_cls, reverse)
            if not isinstance(reverse_attr, DeferredProp):
                reverse_attr.property._add_reverse_property(key)
                rel._add_reverse_property(reverse)

class FKRelationship(DeferredProp):
    """Generates a one to many or many to one relationship."""

    def __init__(self, target, fk_col, **kw):
        self.target = target
        self.fk_col = fk_col
        self.kw = kw

    def _config(self, cls, key):
        """Create a Column with ForeignKey as well as a relationship()."""

        target_cls = cls._decl_class_registry[self.target]

        pk_target, fk_target = self._get_pk_fk(cls, target_cls)
        pk_table = pk_target.__table__
        pk_col = list(pk_table.primary_key)[0]

        if hasattr(fk_target, self.fk_col):
            fk_col = getattr(fk_target, self.fk_col)
        else:
            fk_col = Column(self.fk_col, pk_col.type, ForeignKey(pk_col))
            setattr(fk_target, self.fk_col, fk_col)

        rel = relationship(target_cls,
                primaryjoin=fk_col==pk_col,
                collection_class=self.kw.get('collection_class', set)
            )
        setattr(cls, key, rel)
        self._setup_reverse(key, rel, target_cls)

class one_to_many(FKRelationship):
    """Generates a one to many relationship."""

    def _get_pk_fk(self, cls, target_cls):
        return cls, target_cls

class many_to_one(FKRelationship):
    """Generates a many to one relationship."""

    def _get_pk_fk(self, cls, target_cls):
        return target_cls, cls

class many_to_many(DeferredProp):
    """Generates a many to many relationship."""

    def __init__(self, target, tablename, local, remote, **kw):
        self.target = target
        self.tablename = tablename
        self.local = local
        self.remote = remote
        self.kw = kw

    def _config(self, cls, key):
        """Create an association table between parent/target
        as well as a relationship()."""

        target_cls = cls._decl_class_registry[self.target]
        local_pk = list(cls.__table__.primary_key)[0]
        target_pk = list(target_cls.__table__.primary_key)[0]

        t = Table(
                self.tablename,
                cls.metadata,
                Column(self.local, ForeignKey(local_pk), primary_key=True),
                Column(self.remote, ForeignKey(target_pk), primary_key=True),
                keep_existing=True
            )
        rel = relationship(target_cls,
                secondary=t,
                collection_class=self.kw.get('collection_class', set)
            )
        setattr(cls, key, rel)
        self._setup_reverse(key, rel, target_cls)

That was highly unpleasant. Please don't paste that much code again.

zzzeek says: OK! It's just doing the foreign key and relationship() for us. If you've worked with straight SQLAlchemy before, most of what's in there shouldn't be too mysterious.

We're getting the "target" of the relationship using _decl_class_registry, a dictionary that gives us the target class based on the string, which is put there by Declarative. We're looking at the existing classes and their __table__ to get at the appropriate primary key (assumed to be non-composite.... a little more magic could certainly improve upon that though!), we create a Column() with ForeignKey() the way you'd normally be doing for all your mapped classes individually, or in the case of many-to-many we just put two of them into a Table. Then we send out a relationship() with what we've come up with. We can stick these attributes right on the classes and Declarative takes care of making sure they are mapped and such.

It's an entirely alternate form of relationship in just 80 lines - there's lots of ways to play with things like this. I personally don't need this much re-working of SQLAlchemy's usual relationship() syntax, and I think most of our users don't either - but the job first and foremost of relationship() is to have awesome functionality. I've seen some requests sometimes to make it do things like this, and one of our goals is to make whatever customizations people need as doable as possible. Patterns like these can change how the rest of your project looks. That one is pretty ambitious - but there's plenty of others that are a lot simpler, and can really cut down on noise throughout the bulk of mapping code:

def string(size):
    """Convenience macro, return a Column with String."""

    return Column(String(size))

def int():
    """Convenience macro, return a Column with Integer."""

    return Column(Integer)

Why thank you !

class Base(object):
    """Base class which auto-generates tablename, surrogate
    primary key column.

    Also includes a scoped session and a database generator.

    """
    @declared_attr
    def __tablename__(cls):
        """Convert CamelCase class name to underscores_between_words
        table name."""
        name = cls.__name__
        return (
            name[0].lower() +
            re.sub(r'([A-Z])', lambda m:"_" + m.group(0).lower(), name[1:])
        )

    id = Column(Integer, primary_key=True)
    """Surrogate 'id' primary key column."""

    @classmethod
    def setup_database(cls, url, create=False, echo=False):
        """'Setup everything' method for the ultra lazy."""

        configure_mappers()
        e = create_engine(url, echo=echo)
        if create:
            cls.metadata.create_all(e)
        cls.session = scoped_session(sessionmaker(e))

Entity = declarative_base(cls=Base)

Well now ! Why didn't you tell us you could do that before ? I've been putting __tablename__ and columns all over the place.

zzzeek says: We get into it to a good degree when we talk about "mixins" here , most of what mixins do can go on your "base" as well.

Alrighty. Short blog post today?

zzzeek says: Indeed. The moral of the story is, SQLAlchemy isn't a framework, and never was...it's a toolkit - you should build things !

Look for SQLAlchemy 0.7's production release soon, in the meantime here's some magic.

The Enum Recipe

January 14, 2011 at 12:46 PM | Code, SQLAlchemy

In the most general sense an enumeration is an exact listing of all the elements of a set. In software design, enums are typically sets of fixed string values that define some kind of discriminating value within an application. In contrast to a generic "dropdown" list, such as a selection of timezones, country names, or years in a date picker, the enum usually refers to values that are also explicit within the application's source code, such as "debit" or "credit" in an accounting application, "draft" or "publish" in a CMS, "everyone", "friends of friends", or "friends only" in your typical social media sell-your-details-to-the-highest-bidder system. Differing values have a direct impact on business logic. Adding new values to the list usually corresponds with the addition of some new logic in the application to accommodate its meaning.

The requirements for an application-level enumeration are usually:

Can represent a single value within application logic with no chance of specifying a non-existent value (i.e., we don't want to hardcode strings or numbers).
Can associate each value with a textual description suitable for a user interface.
Can get the list of all possible values, usually for user interface display.
Can efficiently associate the discriminatory value with many database records.

Representing an enumerated value in a relational database often goes like this:

CREATE TABLE employee_type (
    id INTEGER PRIMARY KEY,
    description VARCHAR(30) NOT NULL
);

CREATE TABLE employee (
    id INTEGER PRIMARY KEY,
    name VARCHAR(60) NOT NULL,
    type INTEGER REFERENCES employee_type(id)
);

INSERT INTO employee_type (id, description) VALUES
    (1, 'Part Time'),
    (2, 'Full Time'),
    (3, 'Contractor');

Above we use the example of a database of employees and their status. Advantages of the above include:

The choice of "employee type" is constrained.
The textual descriptions of employees are associated with the constrained value.
New employee types can be added just by adding a new row.
Queries can be written directly against the data that produce textual displays of discriminator values, without leaving the database console.

But as we all know this approach also has disadvantages:

It's difficult to avoid hardcoding integer IDs in our application. Adding a character based "code" field to the employee_type table, even making the character field the primary key, can ameliorate this, but this is not information that would otherwise be needed in the database. Our DBAs also got grumpy when we proposed a character-based primary key.
To display choices in dropdowns, as well as to display the textual description of the value associated with a particular piece of data, we need to query the database for the text - either by loading them into an in-memory lookup ahead of time, or by joining to the lookup table when we query the base table. This adds noise and boilerplate to the application.
For each new data-driven enumerative type used by the application, we need to add a new table, and populate.
When the descriptive names change, we have to update the database, tying database migration work to what would normally be a user-interface-only update.
Whatever framework we build around these lookup tables, doesn't really work for enumerations that don't otherwise need to be persisted.
If we moved to a non-relational database, we'd probably do this completely differently.

Basically, this approach is tedious and puts information about the enum further away from the application code than we'd prefer.

An alternative to the lookup table is to use a database supplied enumeration. Both MySQL and Postgresql (as of 8.3) offer an ENUM type for this purpose. It's fairly straightforward to create an approximation of an ENUM datatype in most databases by using a CHAR column in conjunction with a CHECK constraint, that tests incoming rows to be within one of a set of possible values.

SQLAlchemy provides an Enum type which abstracts this technique:

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, Enum
Base = declarative_base()

class Employee(Base):
    __tablename__ = 'employee'

    id = Column(Integer, primary_key=True)
    name = Column(String(60), nullable=False)
    type = Column(Enum('part_time', 'full_time', 'contractor', name='employee_types'))

On backends that support ENUM, a metadata.create_all() emits the appropriate DDL to generate the type. The 'name' field of the Enum is used as the name of the type created in PG:

CREATE TYPE employee_types AS ENUM ('part_time','full_time','contractor')

CREATE TABLE employee (
    id SERIAL NOT NULL,
    name VARCHAR(60) NOT NULL,
    type employee_types,
    PRIMARY KEY (id)
)

On those that don't, it emits a VARCHAR datatype and additionally emits DDL to generate an appropriate CHECK constraint. Here, the 'name' field is used as the name of the constraint:

CREATE TABLE employee (
    id INTEGER NOT NULL,
    name VARCHAR(60) NOT NULL,
    type VARCHAR(10),
    PRIMARY KEY (id),
    CONSTRAINT employee_types CHECK (type IN ('part_time', 'full_time', 'contractor'))
)

In the case of PG's native ENUM, we're using the same space as a regular integer (four bytes on PG). In the case of CHAR/VARCHAR, keeping the size of the symbols down to one or two characters should keep the size under four bytes (database-specific overhead and encoding concerns may vary results).

To combine the ENUM database type with the other requirements of source-code level identification and descriptive naming, we'll encapsulate the whole thing into a base class that can be used to generate all kinds of enums:

class EnumSymbol(object):
    """Define a fixed symbol tied to a parent class."""

    def __init__(self, cls_, name, value, description):
        self.cls_ = cls_
        self.name = name
        self.value = value
        self.description = description

    def __reduce__(self):
        """Allow unpickling to return the symbol
        linked to the DeclEnum class."""
        return getattr, (self.cls_, self.name)

    def __iter__(self):
        return iter([self.value, self.description])

    def __repr__(self):
        return "<%s>" % self.name

class EnumMeta(type):
    """Generate new DeclEnum classes."""

    def __init__(cls, classname, bases, dict_):
        cls._reg = reg = cls._reg.copy()
        for k, v in dict_.items():
            if isinstance(v, tuple):
                sym = reg[v[0]] = EnumSymbol(cls, k, *v)
                setattr(cls, k, sym)
        return type.__init__(cls, classname, bases, dict_)

    def __iter__(cls):
        return iter(cls._reg.values())

class DeclEnum(object):
    """Declarative enumeration."""

    __metaclass__ = EnumMeta
    _reg = {}

    @classmethod
    def from_string(cls, value):
        try:
            return cls._reg[value]
        except KeyError:
            raise ValueError(
                    "Invalid value for %r: %r" %
                    (cls.__name__, value)
                )

    @classmethod
    def values(cls):
        return cls._reg.keys()

Where above, DeclEnum is the public interface. There's a bit of fancy pants stuff in there, but here's what it looks like in usage. We build an EmployeeType class, as a subclass of DeclEnum, that has all the things we want at once, with zero of anything else:

class EmployeeType(DeclEnum):
    part_time = "part_time", "Part Time"
    full_time = "full_time", "Full Time"
    contractor = "contractor", "Contractor"

If we're trying to save space on a non-ENUM platform, we might use single character values:

class EmployeeType(DeclEnum):
    part_time = "P", "Part Time"
    full_time = "F", "Full Time"
    contractor = "C", "Contractor"

Our application references individual values using the class level symbols:

employee = Employee(name, EmployeeType.part_time)
# ...
if employee.type is EmployeeType.part_time:
    # do something with part time employee

These symbols are global constants, hashable, and even pickleable, thanks to the special __reduce__ above.

To get at value/description pairs for a dropdown, we can iterate the class as well as the symbols themselves to get 2-tuples:

>>> for key, description in EmployeeType:
...    print key, description
P Part Time
F Full Time
C Contractor

To convert from a string value, as passed to us in a web request, to an EmployeeType symbol, we use from_string():

type = EmployeeType.from_string('P')

The textual description is always available directly from the symbol itself:

print EmployeeType.contractor.description

So we have application level constants, textual descriptions, and iteration. The last step is persistence. We'll use SQLAlchemy's TypeDecorator to augment the Enum() type such that it can read and write our custom values:

from sqlalchemy.types import SchemaType, TypeDecorator, Enum
import re

class DeclEnumType(SchemaType, TypeDecorator):
    def __init__(self, enum):
        self.enum = enum
        self.impl = Enum(
                        *enum.values(),
                        name="ck%s" % re.sub(
                                    '([A-Z])',
                                    lambda m:"_" + m.group(1).lower(),
                                    enum.__name__)
                    )

    def _set_table(self, table, column):
        self.impl._set_table(table, column)

    def copy(self):
        return DeclEnumType(self.enum)

    def process_bind_param(self, value, dialect):
        if value is None:
            return None
        return value.value

    def process_result_value(self, value, dialect):
        if value is None:
            return None
        return self.enum.from_string(value.strip())

The idea of TypeDecorator, for those who haven't worked with it, is to provide a wrapper around a plain database type to provide additional marshaling behavior above what we need just to get consistency from the DBAPI. The impl datamember refers to the type being wrapped. In this case, DeclEnumType generates a new Enum object using information from a given DeclEnum subclass. The name of the enum is derived from the name of our class, using the world's shortest camel-case-to-underscore converter.

The addition of SchemaType as well as the _set_table() method represent a little bit of inside knowledge about the sqlalchemy.types module. TypeDecorator currently does not automatically figure out from its impl that it needs to export additional functionality related to the generation of the CHECK constraint and/or the CREATE TYPE. SQLAlchemy will try to improve upon this at some point.

We can nicely wrap the creation of DeclEnumType into our DeclEnum via a new class method:

class DeclEnum(object):
    """Declarative enumeration."""

    # ...

    @classmethod
    def db_type(cls):
        return DeclEnumType(cls)

So the full declaration and usage of our type looks like:

from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base

class EmployeeType(DeclEnum):
    part_time = "P", "Part Time"
    full_time = "F", "Full Time"
    contractor = "C", "Contractor"

Base = declarative_base()

class Employee(Base):
    __tablename__ = 'employee'

    id = Column(Integer, primary_key=True)
    name = Column(String(60), nullable=False)
    type = Column(EmployeeType.db_type())

Our Employee class will persist its 'type' field into a new ENUM on the database side, and on the Python side we use exclusively EmployeeType.part_time, EmployeeType.full_time, EmployeeType.contractor as values for the 'type' attribute.

The enum is also ideal for so-called polymorphic-discriminators, where different values indicate the usage of different subclasses of Employee:

class Employee(Base):
    __tablename__ = 'employee'

    id = Column(Integer, primary_key=True)
    name = Column(String(60), nullable=False)
    type = Column(EmployeeType.db_type())
    __mapper_args__ = {'polymorphic_on':type}

class PartTimeEmployee(Employee):
    __mapper_args__ = {'polymorphic_identity':EmployeeType.part_time}

TypeDecorator also takes care of coercing Python values used in expressions into the appropriate SQLAlchemy type, so that the constants are usable in queries:

session.query(Employee).filter_by(type=EmployeeType.contractor).all()

A runnable demo of the enumeration recipe is packed up at decl_enum.py

A Tale of Three Profiles

December 12, 2010 at 03:36 PM | Code, SQLAlchemy

(tl;dr - scroll down and look at the pretty pictures !)

Object relational mappers give us a way to automate the translation of domain model concepts into SQL and relational database concepts. The SQLAlchemy ORM's level of automation is fairly high, in that it tracks changes in state along a domain model to determine the appropriate set of data to be persisted and when it should occur, synchronizes changes in state from the database back to the domain model along transaction boundaries, and handles the persistence and in-memory restoration of entity relationships and collections, represented as inter-table and inter-row dependencies on the database side. It also can interpret domain-specific concepts into SQL queries that take great advantage of the relational nature of the backend (which in non-hand-wavy speak generally means, SQLAlchemy is fairly sophisticated in the realm of generating joins, subqueries, and combinations thereof).

These are complex, time consuming tasks, especially in an interpreted language like Python. But we use them for the advantage that we can work with fully realized relational models, mapped to domain models that match our way of looking at the problem, with little to no persistence-specific code required outside of configuration. We can optimize the behavior of persistence and loading with little to no changes needed to business or application logic. We save potentially dozens of lines of semi-boilerplate code that would otherwise be needed within each use case to handle the details of loading data, optimizing the loads as needed, marshalling data to and from the domain model, and persisting changes in state from domain to database in the correct order. When we don't need to code persistence details in each use case, there's less code overall, as well as a lower burden of coverage, testing, and maintenance.

The functionality of the ORM is above and beyond what we gain from a so-called "data abstraction layer". The data abstraction layer provides a system of representing SQL statements, bound values, statement execution, and result sets in a way that is agnostic of database backend as well as DBAPI. A data abstraction layer may also automate supplementary tasks such as dealing with primary key generation and sequence usage, generation and introspection of database schemas, management of functions that vary across backends such as savepoints, two phase transactions, complex datatypes.

The data access layer is of course critical, as an ORM cannot exist without one. SQLAlchemy's own data abstraction system forms the basis of the object relational mapper. But the distinction between the two is so critical that we have always kept the two systems completely separate - in our current documentation the systems are known as SQLAlchemy Core and SQLAlchemy ORM.

Keeping these systems explicitly separate helps the developer be aware of a conscious choice - which is that he or she is deciding that the higher level, but decidedly more expensive automation of the ORM layer is worth it. It's always an option to drop down into direct activity with the data abstraction layer, where the work of deciding what SQL statements to create, how they are constructed, when to execute them, and what to do with their results are up to the developer, but the in-Python overhead is fixed and relatively small. Lots of developers forego the ORM entirely for either performance or preferential reasons, and some developers build their own simple object layers on top of the data abstraction layer.

In my own case, my current project involves finance and the ability to persist sets of data across two dozen tables along the order of 500K rows per dataset (where the "dataset" is a set of analytics and historical information generated daily), loading them back later to generate reports, with heavy emphasis on Decimal formatting and calculations. As my project managers suddenly started handing me test files that consisted not of the few hundred records we tested with but more along the order of 20000-30000 records (where each record in turn requires about 20 rows, hence 500K), rewriting some of the ORM code to use direct data abtraction in those places where we need to blaze through all the records seemed like it might be necessary. But each component that's rewritten using the SQL Expression Language directly would grow in size as explicit persistence logic is added, requiring more tests to ensure correct database interaction, reducing reusability, and increasing maintenance load. It would also mean large chunks of existing domain logic, consisting of business rules and calculations that are shared among many components, would have to be refactored in some way to support the receipt and mutation of simple named-tuple result rows (or something similarly rudimentary, else overhead goes right up again), as well as working as they do now associated with rich domain objects. It would most likely add a large chunk of nuts-and-bolts code to what is already a large application with lots of source files, source that is trying very hard to stick to business rules and away from tinkering.

We'd like to delay having to rewrite intricate sections of batch loading and reporting code into hand-tailored SQL operations for as long as possible. This is a new application still under development; if certain components have been running for months or years without any real changes, the burden of rewriting them as inlined SQL is less than when it is during early development, when rules are changing almost daily and the shape of the data is still a moving target.

So to buy us some time, SQLAlchemy 0.7, like all major releases, gets another boost of performance tuning. This tuning is always the same thing - use profiling to identify high-percentage areas for certain operations, remove any obvious inefficiencies within method bodies, then find ways to reduce callcounts and inline functionality into fewer calls. It's very rare these days for there to be any large bottlenecks with "obvious" solutions - we've been cutting down on overhead for years while maintaining our fairly extensive behavioral contract. The overhead comes from just having a large number of pieces that coordinate to execute operations. Each piece, in most cases, is small and gets its work done using the best Python idioms money can buy. It's the sheer number of components, combined with the plainly crushing overhead each function call in Python produces, that add up to slowness. If we do things the way a tool like Hibernate does them, that is using deep call stacks that abstract each decision into its own method, each of those decisions in turn stowing each of its sub-decisions into more methods, building a huge chain of polymorphic nesting for each column written or read, we'd grind to a halt (note that this is fine for Hibernate since Java has unbelievably fast method invocation).

Enter RunSnakeRun

This time around I dove right into a tool that's new to me which one of our users mentioned on the list. It's a graphical tool called RunSnakeRun. Using it this weekend I was able to identify some more areas that were worth optimizing, as well as to get some better insight on the best way for them to be optimized. In addition to the usual dozens of small inlinings of Python code, cleanups of small crufty sections, I identified and caught two big fish with this one:

The ongoing work with our unit of work code has finally made it possible to batch together INSERT statements into executemany() calls without unreasonable complexity, if certain conditions are met. This was revealed after noticing psycopg2's own execute() call taking up nearly 40% of the flush process for certain large insert operations, operations where we didn't need to fetch back a newly generated primary key, which is usually the reason we can't use executemany().
The fetch of an unloaded many-to-one relationship from the identity map was incurring much larger overhead than expected; this because the overhead of building up a Query object just so that it could do a get() without emitting any SQL became significant after fetching many thousands of objects.

For a little view into what RunSnakeRun can provide, we will illustrate what it shows us for three major versions of SQLAlchemy, given the profiling output of a small test program that illustrates a fairly ordinary model and series of steps against 11,000 objects. Secretly, these particular steps have been carefully tailored to highlight the improvements to their maximum effect.

The code below should be familiar to anyone who uses the SQLAlchemy ORM regularly (and for those who don't, we have a great tutorial!). The model is innocuous enough, using joined table inheritance to represent two classes of Employee in three tables. Here, a Grunt references a Boss via many-to-one:

from sqlalchemy import Column, Integer, create_engine, ForeignKey, \
                    String, Numeric
from sqlalchemy.orm import relationship, Session
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class Employee(Base):
    __tablename__ = 'employee'

    id = Column(Integer, primary_key=True)
    name = Column(String(100), nullable=False)
    type = Column(String(50), nullable=False)

    __mapper_args__ = {'polymorphic_on':type}

class Boss(Employee):
    __tablename__ = 'boss'

    id = Column(Integer, ForeignKey('employee.id'), primary_key=True)
    golf_average = Column(Numeric)

    __mapper_args__ = {'polymorphic_identity':'boss'}

class Grunt(Employee):
    __tablename__ = 'grunt'

    id = Column(Integer, ForeignKey('employee.id'), primary_key=True)
    savings = Column(Numeric)

    employer_id = Column(Integer, ForeignKey('boss.id'))
    employer = relationship("Boss", backref="employees",
                                    primaryjoin=Boss.id==employer_id)

    __mapper_args__ = {'polymorphic_identity':'grunt'}

We create 1000 Boss objects and 10000 Grunt objects, linking them together in such a way that results in the "batching" of flushes, each of 100 Grunt objects. This is much like a real world bulk operation where the data being generated is derived from data already present in the database, so installing 10000 objects involves a continuous stream of SELECTs and INSERTs, rather than just a big mass execute of 20000 rows:

sess = Session(engine)

# create 1000 Boss objects.
bosses = [
    Boss(
        name="Boss %d" % i,
        golf_average=Decimal(random.randint(40, 150))
    )
    for i in xrange(1000)
]

sess.add_all(bosses)

# create 10000 Grunt objects.
grunts = [
    Grunt(
        name="Grunt %d" % i,
        savings=Decimal(random.randint(5000000, 15000000) / 100)
    )
    for i in xrange(10000)
]

# associate grunts with bosses, persist 1000 at a time
while grunts:
    boss = sess.query(Boss).\
                filter_by(name="Boss %d" % (101 - len(grunts) / 100)).\
                first()
    for grunt in grunts[0:100]:
        grunt.employer = boss

    grunts = grunts[100:]

sess.commit()

We'll illustrate loading back data on our grunts as well as their bosses into a "report":

report = []

for grunt in sess.query(Grunt):
    report.append((
                    grunt.name,
                    grunt.savings,
                    grunt.employer.name,
                    grunt.employer.golf_average
                ))

The above model is more intricate than it might appear, due to the double dependency the grunt table has to both the employee and boss tables, the dependency of boss to employee, and the in-Python dependency of the Employee class to itself in the case of Grunt-> Boss.

With RunSnakeRun, the main part of the display shows us a graphical view of method calls as boxes, sized roughly according to their proportion of the operation. The boxes in turn contain sub-boxes representing their callees. A method that is called by more than one caller appears in multiple locations. With this system, our picture-oriented brains are led by the nose to the "big shiny colors!" that represent where we need to point our editors and re-evaluate how something is doing what it does.

SQLAlchemy 0.5

SQLAlchemy 0.5.8 was a vast improvement both usage- and performance- wise over its less mature predecessors. I would say that it represented the beginning of the third "era" of SQLAlchemy, the first being "hey look at this new idea!", the second being "crap, we have to make this thing work for real!". I'd characterize the third, current era as "open for business". Here, RunSnakeRun shows us the most intricate graph of all three, corresponding to the fact that SQLA 0.5, for all its huge improvements over 0.4, still requires a large number of prominent "things to do" in order to get the job done (where each box that's roughly 1/10th the size of the whole image is a particular bucket of "some major thing we do"):

SQLAlchemy 0.5.8

Total calls: 10,556,480
Total cpu seconds: 13.79
Total execute/executemany calls: 22,201

SQLAlchemy 0.6

SQLAlchemy 0.6 featured lots more performance improvements, and most notably a total rewrite of the unit of work, which previously was the biggest dinosaur still lying around from the early days. Here we see a 30% improvement in method overhead. There's a smaller number of "big" boxes, and each box has fewer "boxes" inside each one. This represents less complexity of operation in order to accomplish the same thing:

SQLAlchemy 0.6.6

Total calls: 7,963,214
Total cpu seconds: 10.50
Total execute/executemany calls: 22,201

SQLAlchemy 0.7

Finally, with 0.7, the fruits of many more performance improvements within the ORM and SQL execution/expression layers, including the two big ones discussed here, illustrate themselves as even bigger fields with fewer, larger boxes inside each one. Function call overhead here (updated 12/22/2010) is around 50% less than 0.6, 62% less than 0.5, and here we also see a reduction in counts to the DBAPI's execute() and executemany() calls by 50%, using the DBAPIs native capability to emit large numbers of statements efficiently via executemany(). Perhaps I've been staring at colored boxes all weekend for too long, but the difference between 0.5, 0.6, 0.7 seems to me pretty dramatic:

SQLAlchemy 0.7.0 (updated 12/22/2010)

Total calls 3,984,550
Total cpu seconds: 5.99
Total execute/executemany calls: 11,302

So I'd like to give props to RunSnakeRun for giving SQLAlchemy's profiling a good shot in the arm. The continuous refactoring and refinements to SQLA are an ongoing process. More is to come within SQLAlchemy 0.7, which is nearly ready for beta releases. Look for larger and fewer colored boxes as we move to 0.8, 0.9, and...whatever comes after 0.9.

Source code, includes more specifics and annotations related to the test as well as changes in SQLAlchemy versions for this particular kind of model.

« Previous Page

zzzeek

About

Categories

Archives