Wednesday, May 23, 2012

The Girl With The ANSI Tattoo

I enjoyed the David Fincher remake of The Girl With The Dragon Tattoo more than I thought I would. Rather than a shallow and cynical Hollywood cash-in, it's actually a tense, atmospheric, only slightly voyeuristic crime thriller. My favourite part, though, was when Lisbeth Salander begins to solve a 40 year old murder cold case using SQL.

[girl_tattoo_overshoulder.jpg]

We see her tapping at her laptop as she hacks effortlessly into the Swedish police database, interspersed with green-tinted tracking shots of scrolling text as she types in keywords like 'unsolved' and 'decapitation', though never quite the whole query:

[girl_tattoo1.jpg] [girl_tattoo2.jpg]
[girl_tattoo3-mari-magda.jpg]

Naturally I couldn't help stitching a few screenshots together in Photoshop, and this is what I got:

Immediately moviegoers will notice that this can't be Oracle SQL - obviously the AS keyword is not valid for table aliases. In fact as we pull back for a thrilling query results listing we see the mysql prompt and giveaway use [dbname] connect syntax and over-elaborate box drawing.

[girl_tattoo_results1.jpg]

Notice we can just make out the 'FT' of an ANSI left join to the Keyword table.

Finally we get a full-screen shot of the results listing for Västra Götaland:

[girl_tattoo_results2.jpg]

Here's what we were able to reconstruct in the Oracle WTF Forensics department:

SELECT DISTINCT v.fname, v.lname, i.year, i.location, i.report_file
FROM   Incident AS i
       LEFT JOIN Victim AS v on v.incident_id = i.id
       LEFT JOIN Keyword AS k ON k.incident_id = i.id
WHERE  i.year BETWEEN 1947 AND 1966
AND    i.type = 'HOMICIDE'
AND    v.sex = 'F'
AND    i.status = 'UNSOLVED'
AND    (  k.keyword IN
          ('rape', 'decapitation', 'dismemberment', 'fire', 'altar', 'priest', 'prostitute')
        OR v.fname IN ('Mari', 'Magda')
        OR SUBSTR(v.fname, 1, 1) = 'R' AND SUBSTR(v.lname, 1, 1) = 'L' );

+--------+---------+------+-----------+----------------------------------+
| fname  | lname   | year | location  | report_file                      |
+--------+---------+------+-----------+----------------------------------+
| Anna   | Wedin   | 1956 | Mark      | FULL POLICE REPORT NOT DIGITIZED |
| Linda  | Janson  | 1955 | Mariestad | FULL POLICE REPORT NOT DIGITIZED |
| Simone | Grau    | 1958 | Goteborg  | FULL POLICE REPORT NOT DIGITIZED |
| Lea    | Persson | 1962 | Uddevalla | FULL POLICE REPORT NOT DIGITIZED |
| Kajsa  | Severin | 1962 | Dals-Ed   | FULL POLICE REPORT NOT DIGITIZED |
+--------+---------+------+-----------+----------------------------------+

Shocked moviegoers will have been left wondering why a genius-level hacker would outer-join to the Victims and Keywords tables only to use literal-text filter predicates that defeat the outer joins, and whether MySQL has a LIKE operator.

75 comments:

Anonymous said...

Also it seems that the table design has separate fields for the different date parts instead of having a single date datatype column.

ts ts ts

William Robertson said...

I expect they considered "where i.incident_date >= date '1947-01-01' and i.incident_date < date '1967-01-01'", but decided that despite representing a better data model it provided less cinematic impact.

Adrian said...

I was shocked to see such appallingly explicit portrayal of an entity-attribute-value model. Not to mention the sick insinuation of the presence of an in-lined BLOB.

Young impressionable minds witness this kind of thing and studies have shown they go out into the world and commit copy-cat acts.

Just this week I came across an EAV model that mirrored this scene in a chilling manner.

The BBFC must come down on this sort of thing and not merely assume they have discharged their social responsibility by slapping an 18 certificate on it.

Klippx said...

I find the DISTINCT clause funny, why would multiple rows be returned? Sure, there can be many incidents for the same victim but the incident type is limited to "HOMICIDE". Could the same victim be killed multiple times?

I assume l.status is a typo on your side (should be i.status)

Jan S. said...

I hope for a next film remake they hire a senior database expert to help them designing a better data model and also rewrite SQL for more accurate semantic and also better performance.

William Robertson said...

Klippx - I think she needs the DISTINCT because of the keyword list.

Thanks for spotting the typo, I'll fix it...

William Robertson said...

Adrian - I did consider starting with a parental advisory that this blog post contained strong though ANSI-compliant language, but in the end I felt it detracted from the artistic integrity of the piece.

PTW said...

LEFT JOIN...becuse not all incidents have victims or keywords

William Robertson said...

Except ones where v.sex = 'F' and k.keyword in ('fire', 'altar', 'priest', 'prostitute').

Adrian said...

Professor Plum in the Conservatory with the Lead Pipe.

A ruthless and cunning criminal who left no trace of tell-tale keywords behind in his crime scene.

William Robertson said...

WHERE  i.year = 1926
AND    i.type = 'HOMICIDE'
AND    v.sex = 'M'
AND    i.status = 'UNSOLVED'
AND    k.keyword IN ('Candlestick', 'Knife', 'Lead Pipe', 'Revolver', 'Rope', 'Spanner')
AND    SUBSTR(v.lname, 1, 1) = 'B');

TBolt said...

Hey, it worked in the movie. :) The casual viewer is convinced she is a hacker queen.

Awesome post!

Noons said...

I'll bet this will show up as pasted code in a java app at some stage in the near future!

Adrian said...

SQL? Pasted into a Java app? Are you crazy? Something similar, yet totally unintelligible may be dynamically generated by Hibernate in the data persistence layer, surely?

Aalaap said...

Posts like these reinstate my lost faith in humanity. I love you guys.

Unknown said...

"LEFT JOIN...becuse not all incidents have victims or keywords "

But one of the predicates is that the keyword falls within a set, so the keywords are not optional.

And to a prior comment, the distinct is necessary due to the presumable many to one relationship between keyword and incident.

Sheeri K. Cabral said...

MySQL does, in fact, have a LIKE operator (though I don't see where it's used in the snippet you posted).

Anonymous said...

Great post *and* comments, but definitely: Noons FTW!

Kieran said...

They should have just stuck with the usual rotating 3D shapes, furious key-bashing and flashing "HACK IN PROGRESS" text. Far less embarrassing.

Harrison said...

I wrote the prompt code for the MySQL command line client. I did not get royalties or even put in the movie credits =(

raides said...

Oracle owns mysql now, maybe the line shouldn't have been taken so literal and it was just a mistake by the person saying it.

Unknown said...

What's the EAV portion that Adrian is alluding to? I feel I must be missing something.

Rottenchester said...

Hey, you might understand SQL but how about some basic HTML: Don't hot link images from some poor guy's site. Download them and upload them to blogger so you won't make him exceed his quota at his website provider.

sep332 said...

Rottenchester: It's his own domain that's hosting the images :)

William Robertson said...

Yup, great idea that turned out to be...

Anonymous said...

Scumbag author:

Knows SQL





can't link images to save his own life

William Robertson said...

Thanks RC :) I wasn't expecting someone to link to it from news.ycombinator.com and suddenly get 65,000 hits. Just moved them to Flickr, normal service now resumed...

William Robertson said...

Sheeri -
> "MySQL does, in fact, have a LIKE operator (though I don't see where it's used in the snippet you posted)."

That was my point. They expect a movie audience to believe that a world-class hacker would type "substr(v.fname ,1,1) = 'L'" when she could have just used "v.fname like 'L%'", and possibly got an index range scan?

William Robertson said...

Luke - I often wonder what Adrian is on about. No idea either.

Unknown said...

I've used "as" with Oracle...

William Robertson said...

But only in SELECT lists and WITH clauses ;)

The fact that the "table name AS" syntax is what jumped out at me when I watched this scene is what I really find funny.

Pedro Alves said...

Real man would have used MDX ;)

rpbouman said...

@pedro: you're just saying that because it rhymes to "cam wee sex"

said...

Honestly, I'm just impressed that they used real SQL, a real database, a real terminal emulator, and real Google Maps. So many movies depict hacking/computers in general as looking like some kind of video game...

Klippx said...

Relevant: http://www.youtube.com/watch?v=dFUlAQZB9Ng

"It's a UNIX system!"

Massimo Brignoli said...

Congratulations to all for the great comments :)))

Regarding the 'like':

do you know if there is an index on both fname and lname?

If not, a full table will be executed as using the substr.
But the substr is faster.

So the sexy girl is right. Go back to study more guys.

Ronald Bradford said...

Actually the Terminal title states 'mysql' so that solves the problem before even looking at the syntax.

Unknown said...

The terminal window title says "Terminal - mysql". So MySQL it is.

jjmahe said...

Great post !

Bartolomeo said...

You fucking nerds need to get laid...

Joel Garry said...

I have little confidence in policepersons being able to spell dismemberment in their reports. It's not as though they had drop down menus in 1947.

Regarding multiple homicides on a single victim, sure, why not? All you need are methods that are not instantaneous.

They should pipe the output through DECTALK, the Swedish accent would add verisimilitude.

I am not a robot. But my kid is. urytua

William Robertson said...

@Massimo
> "But the substr is faster."

That's right, "substr(v.fname ,1,1) = 'L'" is much faster than "v.fname like 'L%'". Everyone knows that.

Anonymous said...

Gotta watch that movie at last, that's one more incentive.

Anonymous said...

Guys... If the database was perfect they would also have perfect security, which would ruin the whole story.

In the real world, budgets require shit security and shit database design.

Unknown said...

I think it's super awesome that they're actually using not only a real computer OS, but an actual real computer language and terminal interface. And you're complaining that the results don't match the query?

Adrian said...

Bartholomeus is obviously not aware of the many hot and available women who hang out in the SQL bars of London, just waiting to be impressed by sardonic attacks on the misrepresentation of predicate logic in the contemporary media.

Unknown said...

@klippx:

I don't know about the rest of the world, but in Sweden the same crime might get several incident reports, and it won't be cleaned up.

For example: One row for the one witness, and another for another witness if the cops aren't sure it really is the same murder. Another report for the first parts of the body, another for the chopped up parts nearby, and yet another for the place where they dismembered the body (and therefore might actually had killed the victim) etc etc etc.

In short, there are about 100 murders in Sweden annually, and about 150 reports. If you want to be sure, about the numbers you have to remove multiple rows.

William Robertson said...

@Logan - we love them for it too. Nobody's complaining.

I look forward to seeing a lot more SQL in the movies after this.

William Robertson said...

@Robbin - please tell me you hacked into the Västra Götaland police DB for those stats.

zaph said...

strange there are so few results considering they also wanted to know whether a priest was involved in raping a 'female prostitute'; multiply that by 100 if it the homicide involved the rape of a minor....a parody, not to be taken seriously!

MissyM said...

Also, I'm fairly sure that Swedish domestic police databases would have keywords in Swedish??

inside-out boy said...

You really took the magic out of the movie.. thx

Avenger7x said...

Guys... It's MySQL, not Oracle SQL

Oliver said...

... OR ( SUBSTR(v.fname, 1, 1) = 'R' AND SUBSTR(v.lname, 1, 1) = 'L' );

"The inspector then concentrated on the names. He obtained a list of everyone in Hedestad named Mari, Magda, or Sara or who had the initials R.L. or R.J. He had a
list of 307 people. Among these, 29 actually had some connection to Harriet."

Joel Garry said...

OK, I am a robot.

William Robertson said...

Joel - why do you think you're a robot? We're here to help.

William Robertson said...

Oliver - are you writing a novel?

Joel Garry said...

I've been having difficulty reading many of the recaptcha presented by various bloggers. I don't have problems (besides carelessness) reading anything else. What do you make of the number in the "robot" link I posted?

William Robertson said...

I don't see a number. But maybe it's only visible to robots.

Joel Garry said...

Yet, the Please prove you're not a robot expects you to enter the number. Therefore, if you can't see the number, you are a robot.

I just think this is an irritating wtf. Unfortunately, until some real non-repudiable infrastructure comes along, I'll be complaining about it.

William Robertson said...

I did think about switching it over to the Voight-Kampff option, but that needs a camera to monitor pupil dilation and a lot of questions about turtles.

Ahewanko said...

I'm glad all of you geniuses are all in one place. It's like everyone saw the same movie and then talking about it scientifically. wow, I'm impressed.

Soco said...

I think I see the film William, had not thought these issues.
It looks interesting.
Best regards!

Unknown said...

lmao awesome post

mvmn said...

I don't get it - what's wrong with the JOIN? She wants to see the data from joined tables in the output after all.

I'd like to see the actual query rewritten by you, guys, to see what you think should be the correct version. Not just pointing fingers "this is wrong, that is wrong".

The "like" part is bad of course, but well, maybe "substring" + "=" will ensure case sensitivity? Not sure why would that be needed, but well, it's just a movie after all - you can forgive substring instead of like in a movie. I know you can - just think of all the other movies about hackers and computers ((-;

William Robertson said...

It was the outer joins that (surely) shocked moviegoers. If we were to rewrite it in a real SQL crimefighting scenario tomorrow we would leave out the "outer" keyword and use the time thus saved for catching criminals.

With hindsight the 'Mari, Magda' thing didn't help but that's forgivable. We probably wouldn't use green on black or uppercase keywords either but again, this is the movies and I suppose they add drama.

More importantly perhaps, this piece is intended as a joke. It is not actually a criticism of any aspect of this tense, stylish, atmospheric and only slightly voyeuristic crime thriller, least of all its database query syntax. It's not "Taken" after all. What was funniest to me was noticing my own reaction to the "AS" keyword as the text scrolled across the screen, and it is this tension between movie drama and office life that I have tried to capture.

Unknown said...

I like soup

Unknown said...

Obviously, I'm late to the party. I worked as a consultant on The Social Network and The Girl with the Dragon Tattoo, part of which work included creating this sequence. I must admit, I never hoped it would get so much attention, nor apprehended that it would be subjected to such scrutiny.

For anyone interested, you can check out the original screen capture here. Also for your reading pleasure (and I'm sure my continued embarrassment), SQL dumps for the mock Värmland and Västra Götaland police databases.

To give you a sense what we were going for, have a look at the script, Cmd-F 161AA. (I have no idea whether this copy is posted legitimately; apologies if it suddenly disappears.)

Thanks for by far the most thorough, if scathing, code review I've ever endured. I'll be coding gingerly for at least a month.

William Robertson said...

Paul - welcome, and thank you. You are a hero to Swedish SQL-based crime fighters everywhere.

Anonymous said...

Wondering how she knew that the keywords would be inserted in lowercase in the database...anyway she saved some time skipping the UPPER function

William Robertson said...

Inspired by Paul's comment and links to the actual source code (always a help in a forensic SQL reconstruction exercise) I took another look at my screenshots and noticed the third small screenshot (beginning 'ctim as v') fits on the end of the longer Photoshopped one - now updated - and explains why there were no results for victims with initials 'R L' (it's part of an 'OR' condition, along with the 'Mari'/'Magda' thing that didn't get any hits either).

Unknown said...

@oracleappsnotes: String matching in MySQL is case-insensitive by default.

Anonymous said...

@Thomas Okken
Ah! That explains it.

Shrimpley Pibbles said...

I just googled "girl with the dragon tattoo SQL" after taking the exact same screenshot as your last one. I didn't realize there was a fuller view of the query. The end of the query has an "AND" [AND SUBSTR(v.lname,1,1) = L]. So, for the whole outer AND (containing [k.keyword IN ...], a last name starting with L is a requirement! Right? None of the last names start with L. All the other parts of the WHERE clause are ANDs.

Maybe I'm missing something, but it doesn't seem like the query results could have come from this query. Is it a movie mistake, or is it me?

My impression is she should have written the last line with another set of parentheses, because wasn't she looking for "R.L."?

OR (SUBSTR(v.fname, 1, 1) = 'R' AND SUBSTR(v.lname, 1, 1) = 'L' ));

William Robertson said...

The final "AND" consists of three alternative conditions separated by "OR". The results shown must have matched on one or more of the keywords in the first "OR" condition, and the whole 'Mari', 'Magda', 'R. L.' hunch went nowhere.