Measuring Slow Network Time Between Application Server and Database
Recently I was implementing the ADF Performance Monitor at a customer site and doing a detailed performance analysis. This customer had severe (and less severe) performance problems but was unable to pinpoint the exact pain points. Frequently, end-users were complaining because they were experiencing first a very good performance (response times less than a second), and then suddenly experiencing a drop in performance for a certain period of time (an hour or so) with response times between five to twenty seconds.
In this particular case, at the hour overview of the ADF Performance Monitor (from 14:00 to 15:00) we can see the sudden start – and end of this drop in performance. In the top right section HTTP response times can be viewed over the selected time range. This graph makes visible when the load is high (and how high), how the response times are distributed over the categories (very slow, slow and normal) and when there are performance problems (more red and yellow colored parts of the bars).Visible is that from 14:10 to 14:40 the response times have increased significantly. This should be a trigger to drill down to this period in the monitor – for further analysis in order to find the root cause.
We can drill down from an hour to a 5 minute range (14:25 – 14:30) by clicking on the graph in the menu. Now we see an an overview of all individual HTTP request in this 5 minute range on the top graph – with time in WebLogic (blue) and time in database (yellow). We see many long response times between 5 and 30 seconds. We see that most is consumed in WebLogic but we still don’t know where the time is spent exactly. Also we can see from that the JVM (bottom graph) appears to be ‘happy’; the JVM can easily manage the load so the JVM is not the problem in this case:
The next step is to drill down to so called ‘ADF request call stacks’ in this time range. An ADF request call stack of the ADF Performance Monitor gives visibility into which ADF method caused other methods to execute, organized by the sequence of their execution. A complete breakdown of the HTTP request is shown by actions in the ADF framework (lifecycle phases, model and ADF BC executions, ApplicationModule pooling, ViewObject query executions, time to fetch database rows into the ADF app, e.g.), with elapsed times and a view of what happened when. Call stacks are a very useful help to understand specific situations, it shows bottlenecks and where we can avoid bad ADF practices and write more efficient ADF code. The parts of the ADF Request that consume a lot of time are highlighted and indicated with an alert signal.
After investigating many ADF request callstacks, we we able to detect that the time that was needed to fetch database records by ADF ViewObjects was extremely long. For example, the time shown in this callstack (see image above) to fetch only 173 records into the ViewRowCache of this ViewObject took more than 5,3 seconds and later in the same HTTP request 6,2 seconds (!). And that while the ViewObject query time – method executeQueryForCollection() – took only around 50 milliseconds. Normally, ViewObjects should be able to fetch this number of records in easily only a few milliseconds. ADF always executes this in the following order: first the ViewObject query is executed by the executeQueryForCollection() method. Then, when this method is completed, the database rows of this ViewObject are being fetched. This action and how long this takes is shown in the callstack as ‘Fetching, creating rows for <viewobject instance name>, fetched <xxx rows> ‘.
When the time to fetch rows takes long – and the number of rows fetched database rows is still relatively low – this is a very strong hint that during this request the network between the production WebLogic server and the database seemed to be the bottleneck. This can be the case when they are very far from each other and/or the internet connection is very slow. Or that the network is already very busy.
We knew there was a firewall between the database and the Weblogic server that definitely did not help the communication between these two. But, it turned out that the firewall was not the main problem. It was the periodically database batchjobs (unfortunately executed by the operation team during daytime) that were taking very very many network resources. See the following screenshot from the database Enterprise Manager with metrics from exactly the same time period:
We can clearly see in this picture too that from 14:10 to 14:40 there is many network traffic – caused by a costly database job that was executed by the operation team on that particular moment. The solution is to not execute these batch-jobs anymore during daytime :). Now end-users do not have to suffer anymore from these jobs that could be executed in the night as well.