Wednesday, January 17, 2007

Performance Tuning and The Big O

Oracle Performance Tuning is a big subject, as anyone will appreciate who has read any of a number of books that set out to help you to understand what problem you are trying to solve, what factors may affect the performance you are seeing, what strategies are available to the query optimizer, and so on.

While Cary Millsap's Optimizing Oracle Performance focusses on finding, tracing and prioritising specific problems in the face of vague reports that the system seems a bit slow this week, and Jonathan Lewis' Cost-Based Oracle Fundamentals takes us on a tour of the CBO to help answer such questions as Why isn't my EXISTS query using an index? (and why isn't it faster than the IN version?) a poster on OraFAQ has an approach we've not seen before:

Hi,

The following is a problem I need help with and I am willing to pay for help if necessary. Any info would be greatly appreciated.

Two tables in a database:

Table1 contains a list of phone numbers
Table2 contains a list of phone numbers as well

I would like to create a Table3, in which Table3 contains all numbers from Table1 that is not in Table2. I am looking for the shortest runtime possible, keeping in mind that you can use whatever method(s) you deem necessary.

Table1 contains 30 Million rows,
Table2 contains 2000 rows.

given a regular SQL expression, it will yield Big O(m*n)

Where m = rowcount of Table1
and n = rowcount of Table2

Generate for me, a method in which, runtime will yield Big O (m log2 n).

I don't need code, I want to hear your logic. Table1 is customers, Table2 is a list of prepaid phone numbers. Table3 is list of people to bill.

As usual no database version is given. The first suggestion, as you might expect, is the quite reasonable:

select phone_no from table1
minus
select phone_no from table2;

accompanied by a comment that it doesn't seem like a great piece of schema design to have two tables in the first place. However, the reply comes back:

The guy who wrote this problem just told me that this answer is incorrect.

His response to me was:

it's not too simplistic, but it is incorrect. This will still yield a big O(m*n).

Give it one more try, you are thinking too much in terms of DB.

Ask yourself, what are the only structures that would yield BigO(nlog2n)? Answer that, and you will get your answer.

Any ideas?

Resident mathematician Ross Leishman tried to explain to me what Big O (m log2n) means, and I can confirm that it is not after all a When Harry Met Sally reference as most of us would probably assume. Apparently the version with an EXISTS subquery was what was wanted, which seemed odd to me on a number of levels, not least that an IN subquery would probably produce the same plan, especially in 10g where the new HASH JOIN RIGHT ANTI allows the database to build its hash table from the 2,000-row table2 rather than the 30 million-row table1. But of course we don't know the database version, do we? Or whether the columns are nullable, unique or indexed, or how values are distributed, or really anything about the actual environment that would help in solving a real-world performance problem. I can see where I'm going wrong though. I am thinking too much in terms of DB.

Saturday, January 13, 2007

Short loop

It's good to see a GOTO every now and then. However, the sender of this one (thanks!) was most impressed by the loop that goes from 1 to v_totalcntr, and indeed the fact that there is a loop and a v_totalctr variable at all, with the variable carefully set from the cursor's %ROWCOUNT, when it can only ever have one value:

DECLARE
  CURSOR cur_pricing IS
      SELECT col1, col2
      FROM   sometable;

  var_pricing cur_pricing%ROWTYPE;

BEGIN
  OPEN cur_pricing;
  FETCH cur_pricing INTO var_pricing;

  IF cur_pricing%NOTFOUND THEN
      GOTO continue;
  END IF;

  v_totalcntr := cur_pricing%ROWCOUNT;

  FOR r IN 1..v_totalcntr
  LOOP
      -- loads of stuff here
      -- but no fetch from cur_pricing
      -- not even for the one time this loop will execute :-)
  END LOOP;

  <<continue>>
  NULL;
END;

Of course you could just fetch the value you want and process it, but where would be the fun in that?

PS The person who sent this in emailed me with a point I must admit hadn't occurred to me:

Saw you posted this one - thanks. But did you pick up on what they probably thought they were doing? I think whoever wrote it thought that %rowcount would have the TOTAL number of rows that the cursor would return - so they thought they would be looping around ALL the records in the cursor. The fact that they also forgot to fetch again in the loop just adds to the problem of course.