|
|
 |
Automated Database
Design for Large-Scale Scientific Data
|
Overview:
|
Database support for scientific data
is challenging due to
the massive data volumes and the complex and diverse query workloads in
modern
scientific applications. An effective database physical design is
critical in
supporting a large variety of SQL queries over large-scale data. Taking
advantage of detailed workload information is the key to guiding the
physical
design process towards efficient solutions: For instance, the Sloan
Digital Sky
Survey (SDSS) astronomical database contains tables with hundreds of
attributes, which can be queried in various combinations. Designing
indexes requires
detailed workload knowledge in order to identify attribute subsets that
are
important for queries and must be indexed. Besides performance,
database
physical design must satisfy additional constraints, related to the
maintenance
of large-scale data: Database structures like indexes or materialized
views are
constrained by the available resources (like disk space) and also by
the
intensity of database updates which is proportional do the number of
structures
that must be maintained.
The details of each component of the project are given below.
|
AutoPart:
Automated Schema Design Through Data Partitioning
|
Partitioning
improves performance by replacing massive database tables by smaller
ones, that are faster to access. Workload information is necessary,
however, in order to determine which table attributes must be kept in
the same table fragment so that the queries accessing them does not
suffer a performance penalty by having to combine multiple
fragments. We developed AutoPart,
an algorithm for table partitioning given a representative input query
workload, that can interface to commercial systems, similarly to
available index and materialized view design tools. We
experimented with AutoPart in the context of the SDSS database, and
demonstrated
that it can improve performance by up to an order of magnitude, without
utilizing any additional database structures. On the other hand, when
combined
with indexes, AutoPart required less storage and exhibited much better
update performance compared to an index-only design approach.
Furthermore, we demonstrated that partitioning improves performance
while minimizing the impact on the storage requirements and update
performance of the database.

Schematic Diagram of AutoPart Implementation
|
Using
AutoPart in the Physical Design of Database Caches
|
Making
multi-terabyte scientific databases publicly accessible over the
Internet is increasingly important in disciplines such as Biology and
Astronomy. However, contention at a centralized, backend database is a
major
performance bottleneck limiting the scalability of Internet-based,
database applications. Mid-tier caching reduces contention at the
backend database by
distributing database operations to the cache, increasing scalability.
To improve the performance of mid-tier caches, we propose the caching
of query prototypes, a workload-driven unit of cache replacement in
which the cache object is chosen from various classes of queries in the
workload. In existing mid-tier caching systems, the storage
organization in the cache is statically defined. Our approach adapts
cache storage to workload changes, requires no prior knowledge about
the workload, and is transparent to the application. Experiments over a
one-month, 1.4 million query Astronomy workload demonstrate up to 70%
reduction in network traffic and reduce query response time by up to a
factor of three compared to alternative units of cache replacement.

Schematic Diagram of SkyQuery System
|
Efficient
and Accurate Cost Estimation for Automated Database Design Algorithms
|
State-of-the-art database design
tools rely on the query optimizer
for comparing between physical design alternatives.
Despite providing an appropriate cost model for physical
design, query optimization is a computationally expensive
process. The significant time consumed by optimizer invocations
poses serious performance limitations for physical
design tools, causing long running times, especially for large
problem instances. In this paper we introduce the Index Usage
Model (INUM), a technique for minimizing the number
of optimizer calls performed by index selection algorithms.
INUM intelligently caches the results of a few key optimizer
invocations and reuses them for query cost estimation during
the tuning process. INUM is several orders of magnitude
faster than the query optimizer and produces nearly identical
cost estimates. Integrating INUM with index selection
tools offers significant improvements in terms of their performance
and their recommendation quality.
Design of INUM
|
People:
|
| Carnegie
Mellon
University |
John
Hopkins
University |
|
|
|
|
Publications:
|
- S.Papadomanolakis and A.Ailamaki.
AutoPart: Automated
schema design for
large scientific databases using data partitioning. In Proceedings of
the 16th International Conference on Scientific and Statistical
Database Management, 2004. (PDF)
- Stratos Papadomanolakis, Anastassia
Ailamaki,
"Workload-Driven Schema
Design for Large Scientific Databases", Bulletin of the Technical
Committee on Data Engineering, p. 21, vol. 27(4), (2004). (PS)
- S. Papadomanolakis, A. Ailamaki, T. Tu,
D. R.
O.Hallaron,
G. Heber. “Efficient Query Processing on
Unstructured Tetrahedral Meshes”.
ACM SIGMOD International Conference on Management of Data.(2006) (PDF)
- Minglong Shao, Steven W. Schlosser,
Stratos
Papadomanolakis, Jiri
Schindler, Anastassia Ailamaki, Christos Faloutsos, and Gregory R.
Ganger, "MultiMap: Preserving disk locality for multidimensional
datasets", IEEE International Conference for Data Engineering, p. 1,
vol. 1,(2007). (PDF)
- X. Wang, T. Malik, R. Burns, S.
Papadomanolakis, and
A.
Ailamaki,
"A Workload-Driven Unit of Cache Replacement for Mid-Tier Database
Caching", In Proceedings of Database Systems for Advanced Applications.
(2007). (PDF)
- S. Papadomanolakis, D. Dash, A. Ailamaki,
“Intelligent Use
of the Query Optimizer for Automated Physical Design”, VLDB
2007. (PDF)
- T. Malik, R. Burns, and N.
Chawla. "A Black-Box Approach to Query Cardinality
Estimation," Conference on Innovative Data Systems
Research (CIDR), 2007 (PDF)
- “An Integer Linear Programming
Approach to Database Physical Design”. Stratos
Papadomanolakis, Anastassia Ailamaki, ICDE Workshop on Self-Managing
Database Systems (SMDB 2007).
- "Automated Physical Design for Database
Caches",T. Malik, X. Wang, R. Burns, D. Dash and A. Ailamaki, ICDE Workshop on Self-Managing
Database Systems (SMDB 2008) (PDF)
|
Acknowledgment:
|
We would like to thank NSF for the grant supporting this project: Award # 0431008, COLLABORATIVE RESEARCH: SEI + II (AST): Bypass-Yield Caching for Large-Scale Scientific Database Workloads in the World-Wide Telescope.
|
|
|