The chronicle of performance tuning

In this particular place, now I look back and think back, in the span of two to three years, how the infrastructure has changed with performance tuning effort...

Let's start with the share memory connections, the start of pretty much everything, and I know that in a lot of companies, that's the way the connections are, direct share-memory connection. It provides good performance and simple infrastructure, if that you live with that, that means it has served you well. It also has a few dis- advantages of course. First, it masks the bad coding, shared-memory is forgiving, you don't have to write, even decent code to receive good performance, but that doesn't mean bad coding doesn't existing, they always do. Secondly, it can't scale. You are limited to the size of the database server, unless you can keep adding CPU or memory. The modern servers such as AIX P795 or Solaris, are configured in the optimal setting in the chip. Further more, adding more does necessarily give you the performance you hope for. We will get into that. Thirdly, you put database in danger; when you try to disconnect a user that hold the lock, you are in danger of shutting down the database. Fourthly, it makes hard to debug, if you have thousands of users and dozen of long running job, plus a few dozen of cron jobs running through the database server, you can imagine how hard it is to figure out which is culprit or not. Fifthly, it is hard to maintain, it makes DBA's job really hard when it comes to maintenance. So it's time to move the users out because the CPU has high usage and up to its neck, and hoping moving the users out will alleviate the pain.

Moving the users are, has great advantages, one being the scalability, you can now just add more frontend servers, such as a Linux box, with CPU power and a lot of memory to handle more users. The trade is that the "good performance" you had with share-memory, has been exposed (because of the inefficient code). When I dug in and fixed a critical program by using temp-table caching, the response time dropped from 49 second to less than 2 seconds which was acceptable to the users. On the other hand, how does CPU look? still not so good, the CPU has hardly dropped any and vmstat shows now the server spend more CPU on the system than user, which is not something I wanted. The problem is in network. tcpdump is a good tool is show me exactly what is going on. It will show me how many network msgs are going and coming to a specific server, and how big each msg is. In my case, there were dozen of packets(if I remember it correctly) going on and they are less than 100 bytes each. Further research and digging provided more insight, and the numbers fit. the vmstat shows more context switch, and the chatty network traffic produced a lot of interrupts for the system to handle. While the CPU is taking care of the tasks in the run queue, it has to hold the queue and handles the network interrupts.

So now what? That's when the 102B06 comes in. Working with Progress and they have provided a few knobs that we can use, one most important one, the LRU skip. I am not going to into what LRU is, there is a separate blog for that. By adopting this feature, the LRU latch has dropped from 200,000 - 300,000 per second,to under 100,000 per second, a very significant drop, but not enough to handle 2-3 time of volume. Further more, there are a few other network related features in this release. They are good to have for sure. For one thing, now you will see -Mm does mean something with one of those knobs.

So now we have 1 database server and a few frontend servers. We also moved other longrunning jobs out to their own server, a dedicated place for the long-running processes and cron jobs. There are more server name to remember, but a lot clear in the structure. By utilizing the appserver in DR, even the processes need to update the database don't need to stay on the source side.

To handle more volume, while Progress and IBM are figuring out from the OS and DB point, the business is growing faster than the possible solutions. The application is predominantly reads, to alleviate the system load, from the DBA side, it's the split of the database. I chose couple dozens of table that take up about 40% of whole db reads and moved them to a different AIX LPAR, and now the CPU is about half of what it used to be, which leaves more room to grow.

So that's 3 source database and 2 DR database, plus hand full of frontend server and a server with long running job.

The next step of tuning would be another split and hoping the system CPU will drop after IBM and Progress work out some issues.