Using Beaker for Caching? Why You'll Want to Switch to dogpile.cache
April 19, 2012 at 12:01 PM | Code | View CommentsContinuing on where I left off regarding Beaker in October (see Thoughts on Beaker), my new replacement for Beaker caching, dogpile.cache, has had a bunch of early releases. While I'm still considering it "alpha" until I know a few people have taken it around the block, it should be pretty much set for early testing and hopefully can be tagged as production quality in the near future.
The core of Beaker's caching mechanism is based on code I first wrote in 2005. It was adapted from what was basically my first Python program ever, a web template engine called Myghty, which in turn was based on a Perl system called HTML::Mason. The caching scenarios Beaker was designed for were primarily that of storing data in files, such as DBM files. A key assumption made at that time was that the backends would all provide some system of returning a flag whether or not a key was present, which would precede the actual fetch of the value from the cache. Another assumption made was that the actual lock applied to these backends to deal with the dogpile situation would be at its most "distributed" scope a file-based lock, using flock().
When memcached support was added to Beaker, these assumptions proved to be architectural shortcomings. There is no "check for a key" function in memcached; there's only get(). Beaker's dogpile lock calls "check for a key" twice. As a result, Beaker will in general pull a value out of memcached three times, each time resulting in an "unpickle" of a pickled object. The upshot of this is that Beaker pulls over the network and unpickles your object three times times on every cache hit. Users of Beaker are also well familiar with the awkward lock files Beaker insists on generating, even though there are more appropriate ways to lock for distributed caches.
So for no other reason than these, dogpile.cache's entirely new and extremely simplified architecture is an improvement of vast proportions. The test program below illustrates the improvement in unpickling behavior, as well as dogpile.cache's simplified API:
class Widget(object): """Sample object to be cached. Counts pickles and unpickles. """ pickles = 0 unpickles = 0 def __init__(self, id): self.id = id def __getstate__(self): Widget.pickles +=1 return self.__dict__ def __setstate__(self, state): Widget.unpickles +=1 self.__dict__.update(state) def test_beaker(): from beaker import cache cache_manager = cache.CacheManager(cache_regions={ 'default' :{ 'type':'memcached', 'url':'127.0.0.1:11211', 'expiretime':1, 'lock_dir':'.', 'key_length':250 } }) @cache_manager.region("default", "some_key") def get_widget_beaker(id): return Widget(id) _run_test(get_widget_beaker) def test_dogpile(): from dogpile.cache import make_region from dogpile.cache.util import sha1_mangle_key region = make_region(key_mangler=sha1_mangle_key).configure( 'dogpile.cache.memcached', expiration_time = 1, arguments = { 'url':["127.0.0.1:11211"], }, ) @region.cache_on_arguments() def get_widget_dogpile(id): return Widget(id) _run_test(get_widget_dogpile) def _run_test(get_widget): """Store an object, retrieve from the cache. Wait two seconds, then exercise a regeneration. """ import time Widget.pickles = Widget.unpickles = 0 # create and cache a widget. # no unpickle necessary. w1 = get_widget(2) # get it again. one pull from cache # equals one unpickle needed. w1 = get_widget(2) time.sleep(2) # get from cache, will pull out the # object but also the fact that it's # expired (costs one unpickle). # newly generated object # cached and returned. w1 = get_widget(2) print "Total pickles:", Widget.pickles print "Total unpickles:", Widget.unpickles print "beaker" test_beaker() print "dogpile" test_dogpile()
Running this with a clean memcached you get:
beaker Total pickles: 2 Total unpickles: 6 dogpile Total pickles: 2 Total unpickles: 2
Run it a second time, so that the Widget is already in the cache. Now you get ten unpickles with Beaker compared to dogpile.cache's three:
beaker Total pickles: 2 Total unpickles: 10 dogpile Total pickles: 2 Total unpickles: 3
The advantages of dogpile.cache go way beyond that:
- dogpile.cache includes distinct memcached backends for pylibmc, memcache and bmemcached. These are all explicitly available via different backend names, in contrast to Beaker's approach of deciding for you which memcached backend it wants to use.
- A dedicated API-space for backend-specific arguments, such as all the special arguments pylibmc offers.
- A Redis backend is provided.
- The system of "dogpile locking" is completely modular, and in the case of memcached and Redis, a "distributed lock" option is provided which will use the "set key if not exists" feature of those backends to provide the dogpile lock. A plain threaded mutex can be specified also.
- Cache regions and function decorators are open ended. You can plug in your own system of generating cache keys from decorated functions, as well as what kind of "key mangling" you'd like to apply to keys going into the cache (such as encoding, hashing, etc.)
- No lockfiles whatsoever unless you use the provided DBM backend; and there, you tell it exactly where to put the lockfile, or tell it to use a regular mutex instead.
- New backends are ridiculously simple to write, and can be popped in using regular setuptools entry points or in-application using the register_backend() function.
- Vastly simplified scope - there's no dilution of the task at hand with session, cookie, or encryption features.
- Python 3 compatible in-place with no 2to3 step needed.
So I'm hoping we can all soon get modernized onto dogpile.cache.
Pycon 2012 : Hand Coded Applications with SQLAlchemy
March 12, 2012 at 12:01 PM | SQLAlchemy, Code | View CommentsHere's the slides from my Pycon 2012 talk, "Hand Coded Applications with SQLAlchemy". I had a great time with this talk and thanks all for coming !
Update: Here's the video!
When I first created SQLAlchemy, I knew I wanted to create something significant. It was by no means the first ORM or database abstraction layer I'd written; by 2005, I'd probably written about a dozen abstraction layers in several languages, including in Java, Perl, C and C++ (really bad C and even worse C++, one that talked to ODBC and another that communicated with Microsoft's ancient DB-LIB directly). All of these abstraction layers were in the range of awful to mediocre, and certainly none were anywhere near release-quality; even by late-90's to early-2000's standards. They were all created for closed-source applications written on the job, but each one did its job very well.
It was the repetitive creation of the same patterns over and over again that made apparent the kinds of things a real toolkit should have, as well as increased the urge to actually go through with it, so that I wouldn't have to invent new database interaction layers for every new project, or worse, be compelled by management to use whatever mediocre product they had read about the week before (keeping in mind I was made to use such disasters as EJB 1.0). But at the same time it was apparent to me that I was going to need to do some research up front as well. The primary book I used for this research was Patterns of Enterprise Archictecture by Martin Fowler. When reading this book, about half the patterns were ones that I'd already used implicitly, and the other half were ones that I was previously not entirely aware of.
Sometimes I read comments from new users expressing confusion or frustration with SQLAlchemy's concepts. Maybe some of these users are not only new to SQLAlchemy but are new to database abstraction layers in general, and some maybe even to relational databases themselves. What I'd like to lay out here is just how many of POEAA's patterns SQLAlchemy is built upon. If you're new to SQLAlchemy, my hope is that this list might help to de-mystify where these patterns come from.
These links are from Catalog of Patterns of Enterprise Architecture.
- Data Mapper - The key to this pattern is that object-relational mapping is applied to a user-defined class in a transparent way, keeping the details of persistence separate from the public interface of the class. SQLAlchemy's classical mapping system, which is the usage of the mapper() function to link a class with table metadata, implemented this pattern as fully as possible. In modern SQLAlchemy, we use the Declarative pattern which combines table metadata with the class' declaration as a shortcut to using mapper(), but the persistence API remains separate.
- Unit of Work - This pattern is where the system transparently keeps track of changes to objects and periodically flushes all those pending changes out to the database. SQLAlchemy's Session implements this pattern fully in a manner similar to that of Hibernate.
- Identity Map - This is an essential pattern that establishes unique identities for each object within a particular session, based on database identity. No ORM should be without this feature, as working with object structures and applications of the most moderate complexity is vastly simplified and made more efficient with this pattern in place.
- Metadata Mapping - this chapter in the book is where the name MetaData comes from. The exact correspondence to Fowler's pattern would be the combination of mapper() and Table.
- Query Object - Both the ORM Query and the Core select() construct are built on this pattern.
- Repository - An interface that serves as the gateway to the database, in terms of object-relational mappings. This is the SQLAlchemy Session.
- Lazy Load - Load a related collection or object as you need it. SQLAlchemy, like Hibernate, has a lot of options in how attributes can load things.
- Identity Field - Represent the primary key of a table's row within the object that represents it.
- Foreign Key Mapping - Database foreign keys are represented using relationships in the object model.
- Association Table Mapping - A class can be mapped that represents information about how two objects are related to each other. Use the Association Object for this pattern.
- Embedded Value - a value inline on an object represents multiple columns. SQLAlchemy provides the Composite pattern here.
- Serialized LOB - Sometimes you just want to stuff all the objects into a BLOB. Use the PickleType or roll a JSON type.
- Inheritance Mappers - Represent class hierarchies within database tables. See Inheritance Mapping.
- Optimistic Offline Lock - Set up a version id on your mapping to enable this feature in SQLAlchemy.
Thanks for reading!
Django-style Database Routers in SQLAlchemy
January 11, 2012 at 06:50 PM | SQLAlchemy, Code | View CommentsAs luck would have it, I'm currently assigned to work with some existing Django code. It's been quite a long time I've gone without needing to do so, but now that I'm there, I can start fleshing out how some of the use cases for Django can be approximated in SQLAlchemy.
Today it's something that's really quite simple - how to mimic the Multiple Databases example in Django's docs. The Django approach is to build a "router" object into the system which can then decide on a query-by-query basis which of several databases should be used for each query. For this, they provide an interface which includes db_for_read(), db_for_write(), allow_relation() and allow_syncdb(). In particular, db_for_read() and db_for_write() are used when emitting most SQL. These receive an argument called model, which the docs state is a "type", i.e. a class. Because the database determination is based on type, we call this in SQLAlchemy vertical partitioning - databases are partitioned along types.
The specific example here has the following behavior:
- Write operations occur on a database referred to as master.
- Read operations are split among two different databases referred to as slave1 and slave2.
- Operations for a specific subset of classes denoted by myapp occur on a database referred to as other.
When using SQLAlchemy, we of course don't have the concept of "apps" as a delineating factor between different types. So we'll use the old fashioned approach and just use the type itself, that is, what hierarchy it inherits from, to determine which subsystem of the application it heralds from.
The SQLAlchemy Session makes all decisions about what Engine to use within the get_bind() method. This method is given pretty much the same information that db_for_read() and db_for_write() receive, where a Mapper is passed in, if available, and we can also check if the operation is a "write" just by checking if the Session is flushing.
Imports
We'll build up our example in the usual way, as a full script broken out. First the imports:
from sqlalchemy import Column, Integer, String, MetaData, create_engine from sqlalchemy.orm import scoped_session, sessionmaker, Session from sqlalchemy.ext.declarative import declarative_base import random import shutil
Engine Registry
Then, some Engine instances. Note that, unlike Django, SQLAlchemy is not a framework, and doesn't have a settings.py file or an existing registry of engines. Assuming we can decide on a decent place to put our own registry, we just stick them in a dict:
engines = { 'master': create_engine('sqlite:///master.db', logging_name='master'), 'other': create_engine('sqlite:///other.db', logging_name='other'), 'slave1': create_engine('sqlite:///slave1.db', logging_name='slave1'), 'slave2': create_engine('sqlite:///slave2.db', logging_name='slave2'), }
The example here uses SQLite databases, so that it's quick and easy to actually run the script. However, as I'm not familiar with replication functionality in SQLite, we'll have to "fake" the part where "master" replicates to "slave", just for the sake of example.
Routing
Next comes our implementation of get_bind(). We're building into a more open-ended space here, which can good or bad thing, depending on where you're coming from. Instead of building two "router" classes with two db routing methods each, we just need to make one method with some conditionals:
class RoutingSession(Session): def get_bind(self, mapper=None, clause=None): if mapper and issubclass(mapper.class_, OtherBase): return engines['other'] elif self._flushing: return engines['master'] else: return engines[ random.choice(['slave1','slave2']) ]
So above, we use a three-state conditional to make exactly those same choices the Django example does. When we want to detect the classes destined for the "other" engine, we look for a particular base class called OtherBase. We could just as easily do other kinds of checks here, such as checking for a particular attribute on the class.
We also look at a state variable called _flushing to determine operations destined for the master. The Session._flushing flag is set as soon as the flush procedure begins, and remains on until the method completes. SQLAlchemy uses this flag mainly to prevent re-entrant calls to autoflush when it's already inside of the flush process.
That's all we need in the way of Session, for the moment. To wire this Session up into the standard scoped_session(sessionmaker()) process, we can pass it into sessionmaker():
Session = scoped_session(sessionmaker(class_=RoutingSession))
Model Setup
Next, let's build up the "model". This is the part where we need to come up with an answer for Django's syncdb feature, which of course in SQLAlchemy is MetaData.create_all(). We now have two different schemas - one schema is shared between master, slave1, and slave2, the other is destined for other. We'll split out our table metadata among these backends using two different MetaData() objects. To achieve that transparently, I'll use a trick I don't think people are quite aware of yet, which is to use abstract declarative bases.
Starting with a useful declarative base that will give us an id and a data column, as well as a decent __repr__():
class Base(object): id = Column(Integer, primary_key=True) data = Column(String(50)) def __repr__(self): return "%s(id=%r, data=%r)" % ( self.__class__.__name__, self.id, self.data ) Base = declarative_base(cls=Base)
We then split out Base into two subtypes, which won't be mapped. To tell declarative not to map these non-mixin, direct descendants of Base, we use a fairly new flag called __abstract__:
class DefaultBase(Base): __abstract__ = True metadata = MetaData() class OtherBase(Base): __abstract__ = True metadata = MetaData()
and holy crap look at that a MetaData on each one! You can do that? Sure can - the classes we build on top of DefaultBase will put the Table objects into one MetaData, the classes we built on top of OtherBase will put them into another. We've basically just replaced the .metadata attribute declarative sets up with our own - there's no magic. The DefaultBase and OtherBase classes are also not mapped and are transparent to the mapping mechanism.
Let's build out three "models" (I really prefer to say "class", let's see if I can get to the end of this post without complaining more about it...):
class Model1(DefaultBase): __tablename__ = 'model1' class Model2(DefaultBase): __tablename__ = 'model2' class Model3(OtherBase): __tablename__ = 'model3'
Bang. For those unfamiliar with declarative, because these objects ultimately descend from Base above, they will have an id and a data column available, where id is a surrogate primary key.
Table Creation
We use MetaData.create_all() here. Anything that's DefaultBase should have tables created in master, slave1, slave2:
for eng in 'master', 'slave1', 'slave2': DefaultBase.metadata.create_all(engines[eng])
OtherBase then goes to other:
OtherBase.metadata.create_all(engines['other'])
Usage
We now have the tables built, we can show off things getting sent to master with a basic commit():
s = Session() s.add_all([ Model1(data='m1_a'), Model2(data='m2_a'), Model1(data='m1_b'), Model2(data='m2_b'), Model3(data='m3_a'), Model3(data='m3_b'), ]) s.commit()
If we have SQL echoing on, we'd see transactions starting on each of master and other, the requisite INSERT statements, then the two COMMIT calls. Note that the BEGIN for other doesn't occur until the unit of work actually comes across SQL destined for this engine:
INFO sqlalchemy.engine.base.Engine.master BEGIN (implicit) INFO sqlalchemy.engine.base.Engine.master INSERT INTO model1 (data) VALUES (?) INFO sqlalchemy.engine.base.Engine.master ('m1_a',) INFO sqlalchemy.engine.base.Engine.master INSERT INTO model1 (data) VALUES (?) INFO sqlalchemy.engine.base.Engine.master ('m1_b',) INFO sqlalchemy.engine.base.Engine.other BEGIN (implicit) INFO sqlalchemy.engine.base.Engine.other INSERT INTO model3 (data) VALUES (?) INFO sqlalchemy.engine.base.Engine.other ('m3_a',) INFO sqlalchemy.engine.base.Engine.other INSERT INTO model3 (data) VALUES (?) INFO sqlalchemy.engine.base.Engine.other ('m3_b',) INFO sqlalchemy.engine.base.Engine.master INSERT INTO model2 (data) VALUES (?) INFO sqlalchemy.engine.base.Engine.master ('m2_a',) INFO sqlalchemy.engine.base.Engine.master INSERT INTO model2 (data) VALUES (?) INFO sqlalchemy.engine.base.Engine.master ('m2_b',) INFO sqlalchemy.engine.base.Engine.other COMMIT INFO sqlalchemy.engine.base.Engine.master COMMIT
Now let's query. Normally, if we had a Postgresql or MySQL replication environment set up, the data we write to master would have been copied out to slave1 and slave2. So cover your eyes for a moment while we pretend:
##### PRETEND PRETEND PRETEND ###### # now let's pretend "master' is replicating to "slave1", "slave2" shutil.copy("master.db", "slave1.db") shutil.copy("master.db", "slave2.db") ##### END PRETEND END PRETEND END PRETEND ######
(note that SQLAlchemy as of 0.7 doesn't use a connection pool by default when it talks to SQLite - it just opens from the file each time. As long as nobody's talking to the database, we can swap out the file).
We can do some querying - SQL echoing is displayed inline here:
>>> print s.query(Model1).all() INFO sqlalchemy.engine.base.Engine.slave2 BEGIN (implicit) INFO sqlalchemy.engine.base.Engine.slave2 SELECT model1.id AS model1_id, model1.data AS model1_data FROM model1 [Model1(id=1, data='m1_a'), Model1(id=2, data='m1_b')] >>> print s.query(Model2).all() INFO sqlalchemy.engine.base.Engine.slave1 BEGIN (implicit) INFO sqlalchemy.engine.base.Engine.slave1 SELECT model2.id AS model2_id, model2.data AS model2_data FROM model2 [Model2(id=1, data='m2_a'), Model2(id=2, data='m2_b')] >>> print s.query(Model3).all() INFO sqlalchemy.engine.base.Engine.other BEGIN (implicit) INFO sqlalchemy.engine.base.Engine.other SELECT model3.id AS model3_id, model3.data AS model3_data FROM model3 [Model3(id=1, data='m3_a'), Model3(id=2, data='m3_b')]
Manual Access
Django also includes a "manual" selection mode via the using() modifier on the QuerySet object. Let's illustrate a modified version of our RoutingSession with a similar method added. We add one more check in get_bind(), to look for a local attribute called name - then we build a quick "generative" method that copies the state from one RoutingSession into another, so that the second session shares the same state:
class RoutingSession(Session): def get_bind(self, mapper=None, clause=None ): if self._name: return engines[self._name] elif mapper and issubclass(mapper.class_, OtherBase): return engines['other'] elif self._flushing: return engines['master'] else: return engines[ random.choice(['slave1','slave2']) ] _name = None def using_bind(self, name): s = RoutingSession() vars(s).update(vars(self)) s._name = name return s
So assuming we plugged in the above Session, we can explicitly query the master as:
m1 = Session().using_bind("master").query(Model1).first()
Admittedly, the using_bind() method above might be something I should add some helpers for, like a @generative decorator similar to that available for Query, so that the vars.update() approach is not needed.
I hope this example is useful, and sheds some light on how we do things in SQLAlchemy.
Download the example: sqlalchemy_multiple_dbs.py
Alembic Documentation Ready for Review
November 08, 2011 at 12:01 PM | SQLAlchemy, Code | View CommentsMany of you are aware that for quite some time I've had available a new migrations tool called Alembic. I wrote the majority of this code last year and basically haven't had much time to work on it, save for a little bit of integration at my day job.
Alembic has several areas that improve upon the current crop of migration tools, including:
- Full control over how migrations run, including multiple database support, transactional DDL, etc.
- A super-minimal style of writing migrations, not requiring full table definitions for simple operations.
- No monkeypatching or repurposing of core SQLAlchemy objects.
- Ability to generate migrations as SQL scripts, critical for working in large organizations on restricted-access production systems.
- A non-linear versioning model that allows, somewhat rudimentally, for branching and merging of multiple migration file streams.
- By popular request, designs for some degree of automation will be added, where "obivous" migrations of tables or columns being added or dropped as well as simple column attribute changes can be detected and rendered into migration scripts automatically.
I get asked about migration tools quite often, and I've always mentioned that I have such a tool available in an early form, it just lacks documentation, hoping that one out of so many eager users would be able to chip in a couple of days to help get it going. Turns out it's simply not possible to get someone to document your project for you; so here at the PloneConf sprints I've taken the time at the sprints to create initial documentation. Originally I was going to do some work on Dogpile but I've literally been asked about schema migrations, either in person or via reddit or twitter, about nine times in the past three days.
The documentation for Alembic will provide an overview of the tool, including theory of operation as well as philosophy, and some indicators of features not yet implemented. I would like feedback on the current approach. Alembic is not yet released though can be checked out from Bitbucket for testing. It's a pretty simple tool, 1163 lines of code at the moment, with plenty of room to support a lot more database features in a straightforward way.
I also have an open question about the notion of "automatic" migrations - how does one discern detecting whether or not a table or column has had a name change, or if a new table/column was added and an old one dropped? Just a curiosity as I attempt to avoid reading South's source code.
Mike Bayer is the creator of