The Absolutely Simplest Consistent Hashing Example

July 07, 2012 at 12:01 PM | Code

Lately I've been studying Redis a lot. When using key/value databases like Redis, as well as caches like Memcached, if you want to scale keys across multiple nodes, you need a consistent hashing algorithm. Consistent hashing is what we use when we want to distribute a set of keys along a span of key/value servers in a...well consistent fashion.

If you Google around to learn what consistent hashing means, the article that most directly tells you "the answer" without a lot of handwringing is Consistent Hashing by Tom White. Not only does it explain the concept very clearly, it even has a plain and simple code example in Java.

The recipe in Tom's post is dependent on the capabilities of Java's TreeMap which we don't have in Python, but after some contemplation it became apparent that the functionality of circle.tailMap(hash) is something we already have using bisect, that is, we have a sorted array of integers, and a new number. Where in the array does the new number go? bisect.bisect() will give you that, with the same efficiency as TreeMap.

As a sanity check, I searched a bit more for Python implementations. I found a recipe by Amir Salihefendic, which seems to be based on the Java recipe and is pretty nice, but in the post he's searching the circle for hash values using a linear search, ouch! Turns out Amir is in fact using bisect in his Python Cheese Shop package hash_ring, but by then it was too late, I had already written my own recipe as well as tests (which hash_ring doesn't appear to have, at least in the downloaded distribution). There's also Continuum, taking a slightly more heavy-handed approach (three separate classes and an expensive IndexError being caught to detect keys beyond the circle). Both systems, Continuum more so, seem to encourage using hostnames directly as keys - as noted by Jeremy Zawodny, with a persistent system like Redis this is a bad idea as it means you can't move a particular key set to a new host.

So spending a bit of NIH capital, here's my recipe, which provides a dictionary interface so that you can store hostnames or even actual client instances, keyed to symbolic names:

import bisect
import md5

class ConsistentHashRing(object):
    """Implement a consistent hashing ring."""

    def __init__(self, replicas=100):
        """Create a new ConsistentHashRing.

        :param replicas: number of replicas.

        """
        self.replicas = replicas
        self._keys = []
        self._nodes = {}

    def _hash(self, key):
        """Given a string key, return a hash value."""

        return long(md5.md5(key).hexdigest(), 16)

    def _repl_iterator(self, nodename):
        """Given a node name, return an iterable of replica hashes."""

        return (self._hash("%s:%s" % (nodename, i))
                for i in xrange(self.replicas))

    def __setitem__(self, nodename, node):
        """Add a node, given its name.

        The given nodename is hashed
        among the number of replicas.

        """
        for hash_ in self._repl_iterator(nodename):
            if hash_ in self._nodes:
                raise ValueError("Node name %r is "
                            "already present" % nodename)
            self._nodes[hash_] = node
            bisect.insort(self._keys, hash_)

    def __delitem__(self, nodename):
        """Remove a node, given its name."""

        for hash_ in self._repl_iterator(nodename):
            # will raise KeyError for nonexistent node name
            del self._nodes[hash_]
            index = bisect.bisect_left(self._keys, hash_)
            del self._keys[index]

    def __getitem__(self, key):
        """Return a node, given a key.

        The node replica with a hash value nearest
        but not less than that of the given
        name is returned.   If the hash of the
        given name is greater than the greatest
        hash, returns the lowest hashed node.

        """
        hash_ = self._hash(key)
        start = bisect.bisect(self._keys, hash_)
        if start == len(self._keys):
            start = 0
        return self._nodes[self._keys[start]]

The map is used as a dictionary of node names to whatever you want, such as here we use Redis clients:

import redis
cr = = ConsistentHashRing(100)

cr["node1"] = redis.StrictRedis(host="host1")
cr["node2"] = redis.StrictRedis(host="host2")

client = cr["some key"]
data = client.get("some key")

I wanted to validate that the ring is in fact producing standard deviations like those mentioned in the Java article, so this is tested like the following:

import unittest
import collections
import random
import math

class ConsistentHashRingTest(unittest.TestCase):
    def test_get_distribution(self):
        ring = ConsistentHashRing(100)

        numnodes = 10
        numhits = 1000
        numvalues = 10000

        for i in range(1, 1 + numnodes):
            ring["node%d" % i] = "node_value%d" % i

        distributions = collections.defaultdict(int)
        for i in xrange(numhits):
            key = str(random.randint(1, numvalues))
            node = ring[key]
            distributions[node] += 1

        # count of hits matches what is observed
        self.assertEquals(sum(distributions.values()), numhits)

        # I've observed standard deviation for 10 nodes + 100
        # replicas to be between 10 and 15.   Play around with
        # the number of nodes / replicas to see how different
        # tunings work out.
        standard_dev = self._pop_std_dev(distributions.values())
        self.assertLessEqual(standard_dev, 20)

        # if the stddev is good, it's safe to assume
        # all nodes were used
        self.assertEquals(len(distributions), numnodes)

        # just to test getting keys, see that we got the values
        # back and not keys or indexes or whatever.
        self.assertEquals(
                set(distributions.keys()),
                set("node_value%d" % i for i in range(1, 1 + numnodes))
            )

    def _pop_std_dev(self, population):
        mean = sum(population) / len(population)
        return math.sqrt(
                sum(pow(n - mean, 2) for n in population)
                / len(population)
            )

Server Side Templates and API Centric Development

June 18, 2012 at 12:01 PM | Mako, Code
Discuss this on Hacker News

We're here to talk about the rise of the API-focused application, and how it interacts with templates for rendering HTML for web browsers. The simple point I hope to make is: you don't necessarily need to use all client side templates in order to write an effective API-centric web application.

I'll do this by illustrating a simple web application, with a single API-oriented method (that is, returns JSON-oriented data), where the way it's rendered and the style of template in use is entirely a matter of declarative configuration. Rendering of full pages as well as Ajax delivered "fragments" are covered using both server- and client-side rendering approaches.

API Centric Development

Most dynamic web applications we write today have the requirement that they provide APIs - that is, methods which return pure data for the consumption by a wide variety of clients. The trend here is towards organizing web applications from the start to act like APIs. In a nutshell, it means we are moving away from mode of data models injected into templates:

def get_balance(request):
    balance = BankBalance(
        amount = Amount(10000, currency="euro"),
        as_of_date = datetime.datetime(2012, 6, 14, 12, 0, 0)
    )
    return render_template("balance.html", balance=balance)

and instead towards returning JSON-compatible data structures, with the "view" being decided somewhere else:

@view_config('balance', renderer='json')
def balance(request):
    return {
        'amount':10000,
        'currency':'euro',
        'as_of_date':'2012-06-14 14:12:00'
    }

This is a fine trend, allowing us to make a clearer line between server side concepts and rendering concepts, and giving us an API-centric view of our data from day one.

Templates

There's a misunderstanding circulating about API-centric development which says that we have to use client side templates:

API-centric development. If you take a client-side rendering approach, odds are the server side of your web application is going to look more like an API than it would if you were doing entirely server-side rendering. And if/when you have plans to release an API, you'd probably already be 90% of the way there. (http://openmymind.net/2012/5/30/Client-Side-vs-Server-Side-Rendering/#comment-544974348)

and:

Lastly, another issue i can see, assuming you develop using MVC. You need to have the view, model and controller very tightly coupled to make his way work. The controller needs to know how the view works to create html that can slot right in. It's easier if the controller only has to pass data, not layout information. (http://www.reddit.com/r/programming/comments/ufyf3/clientside_vs_serverside_rendering/c4v5vjb)

This second quote is specific to the approach of delivering rendered HTML in an ajax response, to be directly rendered into a DOM element, which we will also demonstrate here. How the "model" is more tightly coupled to anything when you have a server side vs client side template, I have absolutely no idea.

Why might we want to stick with server side templates? As a Python developer, in my view the main reason is that they are still a lot easier to develop with, assuming we aren't developing our server in Javascript as well. Consider if our bank account balance needed locale-specific number, currency and date formatting, and also needed to convert the timestamp from UTC into a preferred timezone. A server side approach allows us to easily inject more functionality into our template as part of its standard environment for all render calls, such as something like this:

def render_template(request, template, **template_ns):
    number_formatter = number_formatter(request.user.preferred_locale)
    template_ns['currency_format'] = currency_formatter(number_formatter)
    template_ns['date_format'] = date_formatter(
                                request.user.preferred_timezone,
                                request.user.preferred_date_format)
    return lookup.get_template(template).render(**template_ns)

The template can, if it needs to, call upon these methods like this:

<ul>
    <li>Balance: ${data['amount'] | currency_format(data['currency'])}</li>
    <li>As of: ${data['as_of_date'] | date_format}</li>
</ul>

With a client side template, we have to implement all of currency_formatter, number_formatter, date_formatter in Javascript. These methods may need to respond to business-specific inputs, such as specific user preferences or other rules, that also need to be pushed up to the client. All the additional Javascript logic we're pushing out to the client brings forth the need for it to be unit tested, which so far means we need to add significant complexity to the testing environment by running a Javascript engine like node.js or something else in order to test all this functionality. Remember, we're not already using node.js to do the whole thing. If we were, then yes everything is different, but I wasn't planning on abandoning Python for web development just yet.

Some folks might argue that elements like date conversion and number formatting should still remain on the server, but just be applied by the API to the data being returned directly. To me, this is specifically the worst thing you can do - it basically means that the client side approach is forcing you to move presentation-level concepts directly into your API's data format. Almost immediately, you'll find yourself having to inject HTML entities for currency symbols and other browser-specific markup into this data, at the very least complicating your API with presentational concerns and in the worst case pretty much ruining the purity of your API data. While the examples here may be a little contrived, you can be sure that more intricate cases come up in practice that present an ongoing stream of hard decisions between complicating/polluting server-generated API data with presentation concepts versus building a much heavier client than initially seemed necessary.

Also, what about performance? Don't client side/server side templates perform/scale/respond worse/better? My position here is "if we're just starting out, then who knows, who cares". If I've built an application and I've observed that its responsiveness or scalability would benefit from some areas switching to client side rendering, then that's an optimization that can be made later. We've seen big sites like LinkedIn switch from server to client side rendering, and Twitter switch from client to server side rendering, both in the name of "performance"! Who knows! Overall, I don't think the difference between pushing out json strings versus HTML fragments is something that warrants concern up front, until the application is more fully formed and specific issues can addressed as needed.

The Alternative

Implementing a larger client-side application than we might have originally preferred is all doable of course, but given the additional steps of building a bootstrap system for a pure client-side approach, reimplementing lots of Python functionality, in some cases significant tasks such as timezone conversion, into Javascript, and figuring out how to unit test it all, is a lot of trouble for something that is pretty much effortless in a server side system - must we go all client-side in order to be API centric?

Of course not!

The example I've produced illustrates a single, fake API method that produces a grid of random integers:

@view_config(route_name='api_ajax', renderer='json')
def api(request):
    return [
        ["%0.2d" % random.randint(1, 99)  for col in xrange(10)]
        for row in xrange(10)
    ]

and delivers it in four ways - two use server side rendering, two use client side rendering. Only one client template and one server side template is used, and there is only one API call, which internally knows nothing whatsoever about how it is displayed. Basically, the design of our server component is un-impacted by what style of rendering we use, save for different routing declarations which we'll see later.

For client side rendering of this data, we'll use a handlebars.js template:

<p>
    API data -
    {{#if is_page}}
        Displayed via server-initiated, client-rendered page
    {{/if}}
    {{#unless is_page}}
        Displayed via client-initiated, client-rendered page
    {{/unless}}
</p>

<table>
    {{#each data}}
        <tr>
            {{#each this}}
                <td>{{this}}</td>
            {{/each}}
        </tr>
    {{/each}}
</table>

and for server side rendering, we'll use a Mako template:

<%inherit file="layout.mako"/>

${display_api(True)}

<%def name="display_api(inline=False)">
    <p>
        API data -
        % if inline:
            Displayed inline within a server-rendered page
        % else:
            Displayed via server-rendered ajax call
        % endif
    </p>
    <table>
        % for row in data:
            <tr>
                % for col in row:
                    <td>${col}</td>
                % endfor
            </tr>
        % endfor
    </table>
</%def>

The Mako template is using a <%def> to provide indirection between the rendering of the full page, and the rendering of the API data. This is not a requirement, but is here because we'll be illustrating also how to dual purpose a single Mako template such that part of it can be used for a traditional full page render as well as for an ajax-delivered HTML fragment, with no duplication. It's essentially a Pyramid port of the same technique I illustrated with Pylons four years ago in my post Ajax the Mako Way, which appears to be somewhat forgotten. Among other things, Pyramid's Mako renderer does not appear integrate the capability to call upon page defs directly, even though I had successfully lobbied to get the critical render_def() into its predecessor Pylons. Here, I've implemented my own Mako renderer for Pyramid.

Key here is that we are separating the concept of how the API interface is constructed, versus what the server actually produces. Above, note we're using the Pyramid "json" renderer for our API data. Note the term "renderer". Interpreting our API method as a JSON API, as a call to a specific client-side template plus JSON API, or as a server side render or ajax call is just a matter of declaration. The way we organize our application in an API-centric fashion has nothing to do with where the rendering takes place. To illustrate four different ways of interpreting the same API method, we just need to add four different @view_config directives:

@view_config(route_name='server_navigate', renderer='home.mako')
@view_config(route_name='server_ajax', renderer='home|display_api.mako')
@view_config(route_name='client_navigate', renderer='display_api.handlebars')
@view_config(route_name='api_ajax', renderer='json')
def api(request):
    return [
        ["%0.2d" % random.randint(1, 99)  for col in xrange(10)]
        for row in xrange(10)
    ]

The rendering methods here are as follows:

  • Method One, Server Via Server - The api() view method returns the data, which is received by the home.mako template, which renders the full page, and passes the data to the display_api() def within the same phase for all-at-once server side rendering.
  • Method Two, Server Via Client - A hyperlink on the page initiates an ajax call to the server's server_ajax route, which invokes the api() view method, and returns the data directly to the display_api def present in the home.mako template. For this case, I had to create my own Pyramid Mako renderer that receives a custom syntax, where the page name and def name are separated by a pipe character.
  • Method Three, Client Via Server - Intrinsic to any client side rendered approach is that the server needs to first deliver some kind of HTML layout, as nothing more than a launching point for all the requisite javascript needed to start rendering the page for real. This method illustrates that, by delivering the handlebars_base.mako template which serves as the bootstrap point for any server-initiated page call that renders with a client side template. In this mode, it also embeds the data returned by api() within the <script> tags at the top of the page, and then invokes the Javascript application to render the display_api.handlebars template, providing it with the embedded data. Another approach here might be to deliver the template in one call, and to have the client invoke the api() method as a separate ajax call, though this takes two requests instead of one and also implies adding another server-side view method.
  • Method Four, Client Via Client - A hyperlink on the page illustrates how a client-rendered application can navigate to a certain view, calling upon the server only for raw data (and possibly the client side template itself, until its cached in a client-side collection). The link includes additional attributes which allow the javascript application to call upon the display_api.handlebars template directly, and renders it along with the data returned by calling the api() view method with the json renderer.

Credit goes to Chris Rossi for coming up with the original client-side rendering techniques that I've adapted here.

A screen shot of what we're looking at is as follows:

Template Demo Screenshot

The demonstration here is hopefully useful not just to illustrate the server techniques I'm talking about, but also as a way to play around with client side rendering as well, including mixing and matching both server and client side rendering together. The hybrid approach is where I predict most applications will be headed.

You can pull out the demo using git at https://bitbucket.org/zzzeek/client_template_demo. Enjoy !


Using Beaker for Caching? Why You'll Want to Switch to dogpile.cache

April 19, 2012 at 12:01 PM | Code

Continuing on where I left off regarding Beaker in October (see Thoughts on Beaker), my new replacement for Beaker caching, dogpile.cache, has had a bunch of early releases. While I'm still considering it "alpha" until I know a few people have taken it around the block, it should be pretty much set for early testing and hopefully can be tagged as production quality in the near future.

The core of Beaker's caching mechanism is based on code I first wrote in 2005. It was adapted from what was basically my first Python program ever, a web template engine called Myghty, which in turn was based on a Perl system called HTML::Mason. The caching scenarios Beaker was designed for were primarily that of storing data in files, such as DBM files. A key assumption made at that time was that the backends would all provide some system of returning a flag whether or not a key was present, which would precede the actual fetch of the value from the cache. Another assumption made was that the actual lock applied to these backends to deal with the dogpile situation would be at its most "distributed" scope a file-based lock, using flock().

When memcached support was added to Beaker, these assumptions proved to be architectural shortcomings. There is no "check for a key" function in memcached; there's only get(). Beaker's dogpile lock calls "check for a key" twice. As a result, Beaker will in general pull a value out of memcached three times, each time resulting in an "unpickle" of a pickled object. The upshot of this is that Beaker pulls over the network and unpickles your object three times times on every cache hit. Users of Beaker are also well familiar with the awkward lock files Beaker insists on generating, even though there are more appropriate ways to lock for distributed caches.

So for no other reason than these, dogpile.cache's entirely new and extremely simplified architecture is an improvement of vast proportions. The test program below illustrates the improvement in unpickling behavior, as well as dogpile.cache's simplified API:

class Widget(object):
    """Sample object to be cached.

    Counts pickles and unpickles.

    """
    pickles = 0
    unpickles = 0

    def __init__(self, id):
        self.id = id

    def __getstate__(self):
        Widget.pickles +=1
        return self.__dict__

    def __setstate__(self, state):
        Widget.unpickles +=1
        self.__dict__.update(state)

def test_beaker():
    from beaker import cache

    cache_manager = cache.CacheManager(cache_regions={
    'default' :{
            'type':'memcached',
            'url':'127.0.0.1:11211',
            'expiretime':1,
            'lock_dir':'.',
            'key_length':250
        }
    })

    @cache_manager.region("default", "some_key")
    def get_widget_beaker(id):
        return Widget(id)

    _run_test(get_widget_beaker)

def test_dogpile():
    from dogpile.cache import make_region
    from dogpile.cache.util import sha1_mangle_key

    region = make_region(key_mangler=sha1_mangle_key).configure(
        'dogpile.cache.memcached',
        expiration_time = 1,
        arguments = {
            'url':["127.0.0.1:11211"],
        },
    )

    @region.cache_on_arguments()
    def get_widget_dogpile(id):
        return Widget(id)

    _run_test(get_widget_dogpile)

def _run_test(get_widget):
    """Store an object, retrieve from the cache.

    Wait two seconds, then exercise a regeneration.

    """
    import time

    Widget.pickles = Widget.unpickles = 0

    # create and cache a widget.
    # no unpickle necessary.
    w1 = get_widget(2)

    # get it again.  one pull from cache
    # equals one unpickle needed.
    w1 = get_widget(2)

    time.sleep(2)

    # get from cache, will pull out the
    # object but also the fact that it's
    # expired (costs one unpickle).
    # newly generated object
    # cached and returned.
    w1 = get_widget(2)

    print "Total pickles:", Widget.pickles
    print "Total unpickles:", Widget.unpickles

print "beaker"
test_beaker()

print "dogpile"
test_dogpile()

Running this with a clean memcached you get:

beaker
Total pickles: 2
Total unpickles: 6
dogpile
Total pickles: 2
Total unpickles: 2

Run it a second time, so that the Widget is already in the cache. Now you get ten unpickles with Beaker compared to dogpile.cache's three:

beaker
Total pickles: 2
Total unpickles: 10
dogpile
Total pickles: 2
Total unpickles: 3

The advantages of dogpile.cache go way beyond that:

  • dogpile.cache includes distinct memcached backends for pylibmc, memcache and bmemcached. These are all explicitly available via different backend names, in contrast to Beaker's approach of deciding for you which memcached backend it wants to use.
  • A dedicated API-space for backend-specific arguments, such as all the special arguments pylibmc offers.
  • A Redis backend is provided.
  • The system of "dogpile locking" is completely modular, and in the case of memcached and Redis, a "distributed lock" option is provided which will use the "set key if not exists" feature of those backends to provide the dogpile lock. A plain threaded mutex can be specified also.
  • Cache regions and function decorators are open ended. You can plug in your own system of generating cache keys from decorated functions, as well as what kind of "key mangling" you'd like to apply to keys going into the cache (such as encoding, hashing, etc.)
  • No lockfiles whatsoever unless you use the provided DBM backend; and there, you tell it exactly where to put the lockfile, or tell it to use a regular mutex instead.
  • New backends are ridiculously simple to write, and can be popped in using regular setuptools entry points or in-application using the register_backend() function.
  • Vastly simplified scope - there's no dilution of the task at hand with session, cookie, or encryption features.
  • Python 3 compatible in-place with no 2to3 step needed.

So I'm hoping we can all soon get modernized onto dogpile.cache.

dogpile.cache documentation.


Pycon 2012 : Hand Coded Applications with SQLAlchemy

March 12, 2012 at 12:01 PM | Talks, SQLAlchemy, Code

Here's the slides from my Pycon 2012 talk, "Hand Coded Applications with SQLAlchemy". I had a great time with this talk and thanks all for coming !

Update: Here's the video!


Patterns Implemented by SQLAlchemy

February 07, 2012 at 12:01 PM | SQLAlchemy, Code

When I first created SQLAlchemy, I knew I wanted to create something significant. It was by no means the first ORM or database abstraction layer I'd written; by 2005, I'd probably written about a dozen abstraction layers in several languages, including in Java, Perl, C and C++ (really bad C and even worse C++, one that talked to ODBC and another that communicated with Microsoft's ancient DB-LIB directly). All of these abstraction layers were in the range of awful to mediocre, and certainly none were anywhere near release-quality; even by late-90's to early-2000's standards. They were all created for closed-source applications written on the job, but each one did its job very well.

It was the repetitive creation of the same patterns over and over again that made apparent the kinds of things a real toolkit should have, as well as increased the urge to actually go through with it, so that I wouldn't have to invent new database interaction layers for every new project, or worse, be compelled by management to use whatever mediocre product they had read about the week before (keeping in mind I was made to use such disasters as EJB 1.0). But at the same time it was apparent to me that I was going to need to do some research up front as well. The primary book I used for this research was Patterns of Enterprise Archictecture by Martin Fowler. When reading this book, about half the patterns were ones that I'd already used implicitly, and the other half were ones that I was previously not entirely aware of.

Sometimes I read comments from new users expressing confusion or frustration with SQLAlchemy's concepts. Maybe some of these users are not only new to SQLAlchemy but are new to database abstraction layers in general, and some maybe even to relational databases themselves. What I'd like to lay out here is just how many of POEAA's patterns SQLAlchemy is built upon. If you're new to SQLAlchemy, my hope is that this list might help to de-mystify where these patterns come from.

These links are from Catalog of Patterns of Enterprise Architecture.

  • Data Mapper - The key to this pattern is that object-relational mapping is applied to a user-defined class in a transparent way, keeping the details of persistence separate from the public interface of the class. SQLAlchemy's classical mapping system, which is the usage of the mapper() function to link a class with table metadata, implemented this pattern as fully as possible. In modern SQLAlchemy, we use the Declarative pattern which combines table metadata with the class' declaration as a shortcut to using mapper(), but the persistence API remains separate.
  • Unit of Work - This pattern is where the system transparently keeps track of changes to objects and periodically flushes all those pending changes out to the database. SQLAlchemy's Session implements this pattern fully in a manner similar to that of Hibernate.
  • Identity Map - This is an essential pattern that establishes unique identities for each object within a particular session, based on database identity. No ORM should be without this feature, as working with object structures and applications of the most moderate complexity is vastly simplified and made more efficient with this pattern in place.
  • Metadata Mapping - this chapter in the book is where the name MetaData comes from. The exact correspondence to Fowler's pattern would be the combination of mapper() and Table.
  • Query Object - Both the ORM Query and the Core select() construct are built on this pattern.
  • Repository - An interface that serves as the gateway to the database, in terms of object-relational mappings. This is the SQLAlchemy Session.
  • Lazy Load - Load a related collection or object as you need it. SQLAlchemy, like Hibernate, has a lot of options in how attributes can load things.
  • Identity Field - Represent the primary key of a table's row within the object that represents it.
  • Foreign Key Mapping - Database foreign keys are represented using relationships in the object model.
  • Association Table Mapping - A class can be mapped that represents information about how two objects are related to each other. Use the Association Object for this pattern.
  • Embedded Value - a value inline on an object represents multiple columns. SQLAlchemy provides the Composite pattern here.
  • Serialized LOB - Sometimes you just want to stuff all the objects into a BLOB. Use the PickleType or roll a JSON type.
  • Inheritance Mappers - Represent class hierarchies within database tables. See Inheritance Mapping.
  • Optimistic Offline Lock - Set up a version id on your mapping to enable this feature in SQLAlchemy.

Thanks for reading!