Wednesday, January 17, 2007

Performance Tuning and The Big O

Oracle Performance Tuning is a big subject, as anyone will appreciate who has read any of a number of books that set out to help you to understand what problem you are trying to solve, what factors may affect the performance you are seeing, what strategies are available to the query optimizer, and so on.

While Cary Millsap's Optimizing Oracle Performance focusses on finding, tracing and prioritising specific problems in the face of vague reports that the system seems a bit slow this week, and Jonathan Lewis' Cost-Based Oracle Fundamentals takes us on a tour of the CBO to help answer such questions as Why isn't my EXISTS query using an index? (and why isn't it faster than the IN version?) a poster on OraFAQ has an approach we've not seen before:

Hi,

The following is a problem I need help with and I am willing to pay for help if necessary. Any info would be greatly appreciated.

Two tables in a database:

Table1 contains a list of phone numbers
Table2 contains a list of phone numbers as well

I would like to create a Table3, in which Table3 contains all numbers from Table1 that is not in Table2. I am looking for the shortest runtime possible, keeping in mind that you can use whatever method(s) you deem necessary.

Table1 contains 30 Million rows,
Table2 contains 2000 rows.

given a regular SQL expression, it will yield Big O(m*n)

Where m = rowcount of Table1
and n = rowcount of Table2

Generate for me, a method in which, runtime will yield Big O (m log2 n).

I don't need code, I want to hear your logic. Table1 is customers, Table2 is a list of prepaid phone numbers. Table3 is list of people to bill.

As usual no database version is given. The first suggestion, as you might expect, is the quite reasonable:

select phone_no from table1
minus
select phone_no from table2;

accompanied by a comment that it doesn't seem like a great piece of schema design to have two tables in the first place. However, the reply comes back:

The guy who wrote this problem just told me that this answer is incorrect.

His response to me was:

it's not too simplistic, but it is incorrect. This will still yield a big O(m*n).

Give it one more try, you are thinking too much in terms of DB.

Ask yourself, what are the only structures that would yield BigO(nlog2n)? Answer that, and you will get your answer.

Any ideas?

Resident mathematician Ross Leishman tried to explain to me what Big O (m log2n) means, and I can confirm that it is not after all a When Harry Met Sally reference as most of us would probably assume. Apparently the version with an EXISTS subquery was what was wanted, which seemed odd to me on a number of levels, not least that an IN subquery would probably produce the same plan, especially in 10g where the new HASH JOIN RIGHT ANTI allows the database to build its hash table from the 2,000-row table2 rather than the 30 million-row table1. But of course we don't know the database version, do we? Or whether the columns are nullable, unique or indexed, or how values are distributed, or really anything about the actual environment that would help in solving a real-world performance problem. I can see where I'm going wrong though. I am thinking too much in terms of DB.

5 comments:

Anonymous said...

It's clearly a homework assignment, so why bother answering.

If you where to bother anyway, there is really no point in the question, since it assumes that the database will use a specific algorithm (or class of algorithms) given even a so simple query. But the database doesn't.

Anyway, if we know the constant cardinality of both tables, almost any algorithm will be O(1) (ie. constant time), which is good but useless information in this case.

William Robertson said...

I thought that too, but apparently it is semi-real. When I asked, the poster said:

"It IS a real world problem. The person who asked me to solve this worked for a telephone company and this was an actual problem he had to solve."

Anonymous said...

One has to check for each record of table 2 whether or not it is in table 1.
For an array of numbers (of m numbers) that are sorted, finding out whether another number is in that array or not, the best algorithm uses o(log2(m)) time. (without using another index or anything.)

The algorithm works as following: look at the middle of the array. If the number to search for is bigger than than the number in the array, then repeat this process with the upper half of the array, otherwise with the lower half.

This gives us the required 0(n*log2(m)) limitation.

Ross Leishman said...

Raymond is absolutely correct, a binary search would yield a Log2(n) cost for each iteration.

The problem is that the original poster was given the problem by some pompous git who mistakenly thinks he knows a bit about Oracle.

Since Oracle does not (to my knowlege) employ a binary search, I leapt to the (apparently correct) assumption that the poster's mentor thinks b-Tree indexes are Binary-Tree, not Balanced-Tree.

The post was a tangled web of lies, half-truths, and unfounded assumptions. But for all that it had a kernal of an interesting problem that - if expressed properly - would have made for an interesting post rather than a painful one.

Unknown said...

actually, you can get roughly O(m+n), which is better than O(m*log(n)) by using a bucket system, or a hashing system with a sufficiently good hashing algorithm.