Monday, February 18, 2013

Python Type Gripe

I've been digging into Python lately. On Friday, I ran into an unintuitive behavior where two sets containing identical objects (of a custom class) were comparing unequal. I bet you can guess the reason: I hadn't defined __hash__ and __eq__ for my class. Take a look at this documentation for __repr__:
For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval(), otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object. A class can control what this function returns for its instances by defining a __repr__() method.
"For many?" "makes an attempt?!" "often including?!!" To my way of thinking, this is a little crazy. There should be a single root class in the inheritance hierarchy, and that root should define __repr__, __hash__, and __eq__ methods that operate on/check for equality of each attribute in the class (by value, not by memory address!) Then they would behave consistently for all classes. Then two sets, each containing three objects of the same class, each containing the same attributes with the same values, would compare equal when you used == on them, following the principle of least surprise.

Of course, I can't rewrite Python to make this happen. I'm tempted just to make my own root class that implements these behaviors via metaprogramming, and make every class I ever define a subclass of it. Is there a better way?

PS: A colleague pointed out that this behavior reflects a deliberate decision in Python design philosophy: "Python […] favors informal protocols over inheritance trees […] You can think of __eq__ and __hash__ as kind of being parts of an informal 'collectable' protocol." That sort of makes sense. Of course, the thing about informal protocols is that there's nothing to enforce consistency.

Friday, February 1, 2013

Why PostgreSQL Doesn't Allow Unsafe Implicit Type Coercions

I got bitten by a MySQL bug today, and it reminded me why we don't allow¹ unsafe type coercions in PostgreSQL. It's easy to talk about them in the abstract, but harder to explain why they're a problem in practice. Here's a really clear example.

I do most of my consulting through pgExperts these days, but I still have a legacy client on MySQL. Today I was writing a new feature for them. Using test-driven development, I was debugging a particular invocation of a five-way join. I found it was returning many rows from a certain right-hand table, when it should have been returning none.

I took a closer look at my SQL and found this code²:

    …
    WHERE order.status_id NOT IN ('completed', 'cancelled')
    …

But that's not right! status_id isn't a varchar. It's an int FK to a lookup table:

    CREATE TABLE statuses (
        id int(10) unsigned NOT NULL AUTO_INCREMENT,
        name varchar(16) NOT NULL,
        PRIMARY KEY (id),
        UNIQUE KEY name (name)
    ) ENGINE=InnoDB;

MySQL was silently coercing my integer order.status_id to a varchar, comparing it to 'completed' and 'cancelled' (which of course it could never match), and returning a bunch of right-hand table rows as a result. The filter was broken.

PostgreSQL would never haved allowed this. It would have complained about the type mismatch between status_is and 'completed' or 'cancelled'. Rather than a confusing resultset, I'd get a clear error message at SQL compile time:

    ERROR: invalid input syntax for integer: "cancelled"
    LINE 8: select * from frobs where order.status_id NOT IN ('cancelled', 'c...
                                                                                               ^

Fortunately, I caught and fixed this quickly (hurrah for test-driven development!) The fix was simple:

    …
    JOIN statuses stat on stat.id = order.status_id
    WHERE stat.name NOT IN ('completed', 'cancelled')
    …

But under different circumstances this could have seriously burned me.

And that's why we disallow certain implicit type coercions. This post is not intended as a rant against MySQL. Rather, it's an illustration of how a seemingly abstract issue can seriously hose your code. Caveat hacker!


1. Since PostgreSQL 8.3 (2008), that is. Before that, we did allow unsafe coercions.

2. I've changed the code and identifiers to protect client confidentiality.