It was Monday the 1st of October 2012 and it was Alberto’s turn to do the early shift at the office. He is a Business Analyst and needed to arrive at 0700 in order to run a few Trade reports for his boss. It was 0630 as he started his car journey from his home in Mamer to Kirchberg. It normally took 15 minutes.
“Schaiss !!” cursed Alberto “I should have used the Waze* application !”
As he joined the A6, traffic was already at a standstill. Alberto’s mood darkened further when he heard on the Radio.
|“ ……. un bouchon de près de 12 km sur l’A6 dans le sens Belgique-Luxembourg …”|
There had in fact been an accident on the A6, nothing serious but enough to cause chaos on the road.
Instead of arriving at his office at 0700 he arrived glumly at 0800. A few of his colleagues had already arrived and were not looking happy either.
“What’s the matter guys ?”, asked Alberto. “Massive traffic ja…”
“The new GUI trade reporting screen is very slow, sub second queries are now taking over 5 minutes !”, interrupted a colleague.
The Trade Reporting screen was a new Java application, connected to an Oracle Database, that was used across the bank to monitor stock trades. It had been used for 6 months now and its migration was quite a success with users very happy with the performance of the screen.
Meanwhile on the other side of the building the production Database Administrators (DBAs) were starting to have a ‘heated’ discussion about the slowing database.
(For the non-DBA literate please feel free to skip this conversation in italics, you will still understand the story. This conversation should be clearer after the rest of the article is read)
“We have multiple full table scans running on the Stock Trade table. There is one year’s worth of stock trades in that table, around 30 Million rows. The Oracle execution plan has also changed since Friday. ” a DBA said.
“Code looks ok, the developers have even started using bind variables “, another replied.
“Tom Kyte would have been proud”, retorted another.
“Was the machine rebooted this weekend?”
“Who loaded the last execution plan ?”
“Richard did … at 0739 today“
Richard was also in the DBA team and had been the first to arrive that morning.
“Yes, the Crontab was not restarted this weekend and there was a quarterly report not run. I ran it this morning before the users noticed. The report ran in half the time it did last quarter”, he said laughingly.
“Well it seems that the execution plan of this report is shared with the GUI querying application. Each time a GUI user runs a query we have a full table scan of the Stock Trade table”
“Ah ! I had no idea that they would share the same plan between a batch and a GUI process … doesn’t really make much sense!” ..
In fact Richard, a Database Administrator for the bank, had to run a quarterly report manually on Monday Morning (at 0739) because the scheduling (Cron) system was not working after the server was rebooted. He was living in Kirchberg and walked to work, not having to contend with any traffic issues.
But had Alberto arrived at 0700, and ran his GUI query before Richard, then no issue would have occurred. He would have ‘saved the day’ without anybody realising and the GUI application would have run as fast as in the weeks before.
So it seemed the winner of the ‘race’ between Richard and Alberto had the more relaxing day. It really is the early bird that catches the worm!
But how can a mission critical application be so sensitive to event order ? A reminder of the key events which caused this production issue :
- Before the weekend, the GUI query had always been run before the Cron quarterly report
- The system was rebooted (and the database was restarted) at the weekend.
- After this latest restart, because the Cron was not restarted, Richard ‘won’ the race by running the quarterly report manually at 0739 – before the GUI query was run.
Alberto’s lateness had in fact highlighted a design frailty in the application that was never spotted in the testing phase. But I suppose nobody would have thanked him for that fact!
What happened ?
The GUI queries and the quarterly report shared the same SQL statement.
SELECT * FROM Stock_trades Where trade_date between :trade_start_date AND :trade_end_date;
After a reboot, when Alberto’s GUI query was run before Richard’s quarterly job then the GUI was fast (less than 1 second) while the quarterly job took much longer. But on the morning of the A6 traffic jam, Richard ran his quarterly job first (before the GUI was run): the quarterly job was faster than before, but the GUI took over 5 minutes. So, whichever of the two jobs runs first is as fast as possible (good) but whichever runs second is much slower than it should be (bad !).
But why ?
To understand why – and (more important) to know how to fix this problem so that both jobs run as fast as possible every time – a bit of background knowledge is needed: not too technical, so please don’t give up reading at this point !
(For the more technical reader, there are some technical notes below written in italics like this. But if you don’t want to hurt your brain too much, please feel free to skip over these extra notes; you’ll still get a clear understanding of what’s going on !)
In every SQL SELECT statement, we have to tell the database exactly what data we want; in this stock trades example, we want all the records within a range of dates. But we don’t tell the database how to find those records; Oracle works that out for itself.
It’s a bit like getting into a taxi: you tell the driver exactly where you want to go, but normally you don’t tell the driver how to get there; the driver decides which route to take.
Inside every database, there’s a piece of decision-making code called the Optimizer, which decides how to find the wanted records. So, the Optimizer decides the route, just like the taxi driver does. The chosen route is called the “Execution Plan”, or just the “Plan” for short.
The Optimizer looks at all the possible plans (routes) and chooses the route which it predicts will be fastest. The problem is: which plan is actually fastest depends on how many records we want from the table.
There are really only two ways to find the records we want within a table: either use an index, or use a full table scan (FTS). An index reads only the records which we actually want, but a FTS reads all the records in the table (all 30 million in this case) and then ignores the ones we don’t want.
An index is faster if we want only a small fraction of the records in the table; but FTS is faster if we want a bigger fraction of the records. Why is this ?
Consider how we use a reference book. If we want to look up some facts we would check the index at the back to find the page numbers that contain these facts. Then, armed with the page numbers, we would go straight to those pages.. Using this book index is very efficient for locating a few pages very quickly.
But what if there are many references to the facts we want ? What if the book has 500 pages, and the set of facts we want are on, say, 250 of the pages ? It would be quicker (less slow) to read every page in the book – a “full book scan” than have to keep going to the index and back again every time we want to find the next reference.
[Reading rows from a table block involves two physical operations: a disk I/O to read the block into memory (if not already there), followed by CPU time to read the row(s) we want from the in-memory copy of the block. Typically the disk I/O takes longer, because disk is usually slower hardware than CPU. So to speed up the execution, our main concern is to minimize the number of disk I/Os, which means minimizing the number of blocks we have to visit.
Each table block can contain many rows; our block size was 8 Kbytes and the Stock Trade table had an average row length of 200 bytes, so we had around 40 rows per block. So the 30 million table rows filled about 750.000 blocks.
So for our quarterly report (wanting 25% of the rows), if all the rows for one quarter are stored together- a high “clustering factor” – they will fill 25% of the blocks, so only this 25% of the blocks would need to be accessed; but if the rows for this quarter are spread out (mixed up with other rows) then every block will contain about 10 of the rows we want (and about 30 rows for other quarters) – so then every block would need to be accessed. In this case, clearly a FTS (reading every table block once) is faster than using an Index (still reading every table block because that’s where our quarter’s data is, and reading index blocks as well).
But if our quarter’s rows fill only 25% of the table blocks, surely it’s faster to use an Index (visiting only 25% of the table blocks plus a few index blocks) than a FTS (visiting 100% of the table blocks) ? Well, no.
What’s important here is not how many blocks do we have to visit, but how many disk I/Os must we do to get these blocks into memory ? When using an index Oracle reads only one block per I/O, but when doing FTS, Oracle reads many blocks in a single I/O – a “multiblock read”. And for our 750.000 blocks, bigger I/Os clearly means fewer I/Os, and is therefore faster.
(Of course, if some of the blocks are already in memory then no I/O is needed at all; and this is more likely when using an index.)
So as the fraction of rows we want increases, it becomes more likely that a FTS will be faster than an Index. But it is clear that the exact “tipping point” depends on several factors. As a very rough guideline, FTS will usually be faster than Index if we want more than about 5% of the rows. ]
So what does this have to do with our problem query on the Stock_trades table ?
Remember that this table stores a full year’s worth of records. Alberto’s GUI report needed only a very small fraction of these (a date range covering only a few hours); but Richard’s quarterly report needed a whole three months of rows. So for the fastest performance, the GUI report should use an index but the Quarterly report should use a FTS.
Back in 2012, the database was running Oracle version 10g. And at 10g, the Optimizer chose a plan only the first time that the query was run after a restart; from the second execution onwards, the same plan was reused again and again: “one plan fits all”. If the GUI query was run first, the Optimizer decided to use the index, which is what we want for the GUI query but not for the quarterly report. On the other hand, if the quarterly report was run first, the Optimizer chose a FTS, which is fine for the quarterly report but not for the GUI query. We do not want “one plan fits all” !
[This happens because Oracle uses bind variable peeking (BVP) to get the actual bind variable values from the application before the Optimizer decides the plan. Using these values, the Optimizer estimates what fraction of the table records the query wants, and can then choose the best plan for those specific values. This is fine, but at 10g and earlier this “peek” was done only the first time the code was run; all later runs reuse the same plan. There was an inbuilt assumption that all runs will want about the same fraction of the records, and therefore “one plan fits all” is fine.]
Here’s what had been happening before, when Alberto’s GUI query was run first:
But here is what happened when Richard got into work early and ran his Q3 report first:
So it’s not surprising that the GUI query was so much slower ! Using FTS, it was reading all 30 million rows when the GUI wanted only a few of them.
So what can be done to fix this problem and prevent it from happening again in the future ?
1. Using Literal values in the code
Our original query used bind variables:
SELECT * FROM Stock_trades Where trade_date between :trade_start_date AND :trade_end_date
The actual values for :trade_start_date and :trade_end_date are passed to Oracle by the calling program at the beginning of execution.
But when we use literal values we put the “actual value” into the SQL statement itself. So an example of a GUI query run would be :
SELECT * FROM Stock_trades Where trade_date between TO_DATE('01-OCT-2012 00:00’,’DD-MON-YYYY HH24:MI’) and TO_DATE('01-OCT-2012 09:00’,’DD-MON-YYYY HH24:MI’)
The advantage of using literals is that the queries are optimized for each new set of actual values. This gives an execution plan that is relevant to each individual query. Had our application used literal values we would have avoided this particular production issue. The GUI would have used an index and the Quarterly job a full table scan. Perfect !
[But the downside of using literals is that the optimization itself takes time; this time is mainly CPU not disk I/O, so it will take far less time than the execution. Even so, after many optimizations the library cache (the Oracle memory where the plans are stored) could be flooded; and only one user at a time can load a new plan, so when several users are running our GUI query at the same time, they may have to wait for the “shared pool latch”, which is a permission to load, further slowing them down. But all the GUI queries want the same plan (index), so it would be best to leave the GUI code as bind variables and change the quarterly report code to use literals.]
That’s the upside but let’s explore the downside. What if we can’t change the code ? This is often difficult or impossible in a production environment. Then (at 10g and earlier) we’re stuck with “one plan fits all”.
2. Creating an AFTER STARTUP trigger
If we must have only one plan, the index plan is better because the GUI queries will execute in 1 second or less, not 5 minutes ! Yes, the quarterly report will be slower using an index, but at least that runs only once per quarter while the GUI query runs many times every day.
So can we ensure that at least we get the index plan every time, not the FTS ? Well, maybe. We could create an AFTER STARTUP trigger in the database: on every restart (before any user queries are run) , the trigger code would automatically execute the GUI query once (to get the index plan) and throw away the query results. So then we’d get the index plan every time.
[However, it’s possible that the GUI index plan could later be aged out of library cache memory due to space pressure.]
3. Turn off bind variable peeking
Or, we could set a system or session parameter to disable bind variable peeking:
ALTER SESSION/SYSTEM SET _OPTIM_PEEK_USER_BINDS=false;
But then, the Optimizer will choose the plan with no knowledge at all of the bind variable values, so we cannot be sure that it will choose the index !
[The column statistics available to the Optimizer will tell it the lowest and highest values which exist in the trade_date column, so it will know that the dates cover exactly one year. But because the query says “… BETWEEN …”, it still can’t work out what fraction of the rows we want. For our GUI query it will probably choose the index, but careful testing would be needed to check this]
4. Range-partition the table with one partition per day
Partitioning means physically dividing the table into smaller sections or chunks, rather like dividing a large book into multiple chapters. Each section is called a partition, and we provide a strict rule to Oracle which states exactly which subset of the table’s records will be stored in each partition. The Optimizer knows this rule, and will choose a plan which reads only the partition(s) which contain the records we want.
This is called Partition Pruning, and is very useful in full table scans (FTS). For our STOCK_TRADES table, we could have 365 partitions, one for each day of the year; each partition would contain exactly one day’s data. This is like a huge reference book with 365 sections; if you know that all the information you need is in only one of the sections – and you know which section it’s in – why read the whole book ? You’d read only the one section which contains the information.
Then, our quarterly report would scan only three month’s worth of partitions (about 90 out of 365), and would therefore be about 4 times faster than a scan of the whole table. It’s still “one plan fits all”, so our GUI query wouldn’t use the index anymore; but it would scan only one partition, because the GUI queries only report on a small time range within one day. That’s still reading 30 million / 365 = about 80,000 rows, but would reduce the GUI’s run time from Alberto’s 5+ minutes down to about 300 seconds / 365 = less than 1 second. Great !
This looks too good to be true, so what’s the possible catch ? Money. To use partitioning legally in Oracle, you need an extra Oracle database licence called the Partitioning Option. If you have this extra licence, fine, go ahead and partition big tables as much as you want. But if you haven’t, and you want to stay legal … I don’t need to spell it out for you !
[Partition pruning will work even though our code uses bind variables; with no knowledge of which date range we want, the Optimizer can’t decide upfront exactly which partitions to scan, but it knows that only a subset of partitions will be relevant. The actual decision on which partition(s) are needed is taken later, at the beginning of the execution (at which time Oracle will know exactly the bind variable values). This is called “run-time pruning”; the plan will include this line:
“PARTITION RANGE ITERATOR PARTITION: KEY KEY……” ]
Beyond Oracle 10g
All the above solutions can solve Alberto’s timing problem, that the chosen plan depends on which query (GUI or quarterly report) is run first after a reboot. So far, so good. But (apart from changing the code), all the possible solutions so far still use “one plan fits all”. But at 11g this changes:
Adaptive cursor sharing
Adaptive cursor sharing (ACS) was introduced in Oracle 11.1 and was seen as the solution to the bind variable “one plan fits all” problem, allowing multiple execution plans for one SQL statement. The optimizer recalculates what fraction of the records are wanted every time the query is run, and can choose a new plan if it would be faster.
[With ACS, bind variable peeking is done every time the query is sent to Oracle, not just the first time. So the Optimizer knows the bind variable values every time, and can calculate what fraction of the records we want every time. If the fraction wanted this time is significantly different from previous runs, a new plan is used. In our bank’s case this would still produce only two plans (index for the GUI and FTS for the quarterly report) because all the GUI queries want a similar small fraction of the records. So no problem with flooding the library cache or shared pool latch contention.]
So for our stock_trades queries, we would get the plans we want: index for the GUI and FTS for the quarterly report. But again, there’s a catch. In fact, two catches:
- It can take several executions of a query before the optimizer “learns” that it wants a very different fraction of the records than before, so the first few executions after a restart can still use the wrong plan.
- The bank’s GUI and quarterly reports coded the SQL SELECT statement inside PL/SQL blocks. There is a known bug which disables ACS when calling from inside PL/SQL:
bug number 8357294: ADAPTIVE CURSOR SHARING DOESN'T WORK FOR SQL FROM PL/SQL
This bug still exists even at 12c and 18c.
SQL Plan Baselines
Let’s assume that the bank’s immediate problem has been fixed using one of the solutions explained above, so now they have the plans they want. Great ! But what about the future ? We have seen that the Optimizer’s choice of plan is very dependent on the data in the table, so if the data changes significantly in the future, the Optimizer’s plans could change – and be worse. For example, what if the STOCK_TRADES table grows over time to store several years’ worth of data ?
Once we’ve got the plans we want, we don’t want to lose them in the future. We want “once a good plan, always a good plan”. Starting at 11.1, Oracle provide a mechanism to do this called a SQL Plan Baseline. In a baseline the good plan(s) are stored permanently, and the Optimizer can only choose a baseline plan. And what if an even better plan is possible in the future ? The baseline can store it for Optimizer use, but only after testing has proved that it’s faster than the existing plan. In short, plans can get better over time, but can never get worse.
If you’re really clever, you can even store your own chosen plan in the baseline although the Optimizer has never chosen it before. But this is quite tricky to do, and it does need a calm and clear head !
If the developers had written the right code in the first place (bind variables for the GUI, literals for the quarterly report), the problems we’ve discussed in this blog would never have happened.
Just as driverless cars are seen as the future, Oracle shares a vision of automated databases. But as seen in the blog it seems that Oracle Developers won’t be able to put their feet up anytime soon. As the Oracle database becomes more complex and moves towards this ‘Utopia’, Developers will be faced with more complex issues and will be in need of training as much, if not more, than before.
* Yes the Waze app did exist in 2012 😉