|
Abstract
Many enterprise IT environments have databases (DBs) running over network-attached storage infrastructures (e.g., Storage Area Networks or SANs). The common practice in these settings is to have two separate teams of administrators, a DB team and a SAN team, to manage the respective subsystems. Each team is knowledgeable about and uses tools specific to its own subsystem, which makes it particularly hard to diagnose performance problems whose effects propagate across subsystems. We are developing a tool, called DIADS, that works in an integrated fashion across the DB and the SAN to diagnose the cause of query slowdowns—where the performance of a repeatedly-run DB query deteriorates over time—based on past and recent system monitoring data. This talk will describe two new features that we have added to DIADS.
Given a query slowdown to be diagnosed, DIADS uses machine-learning techniques to identify relevant symptoms of the slowdown. Invariably, expert knowledge about the DB and SAN is required in order to pinpoint the root cause of the slowdown based on the detected symptoms. We have developed a declarative framework that makes the specification, usage, and maintenance of expert system knowledge for root-cause diagnosis both easy and efficient. We will describe our current implementation and evaluation of this framework, and also discuss future enhancements.
The DB query optimizer, which generates the execution plan for each query, uses a fairly sophisticated process based on data statistics, system settings, and workload properties. Changes to any of these parameters can trigger a change in the choice of execution plan for a repeatedly-run query. If DIADS notices a change of execution plan that correlates with a query slowdown, then it has to find out whether the plan change is a root cause of the slowdown or not. The talk will also describe our progress towards quantifying the impact of an observed change of execution plan on a query slowdown to be diagnosed.