Database System Architecture:
Fig 7 Functional architecture of the database system. Querying: users interact with the WWW Server via a WWW Client Submitting data: Users submit and maintain data through the WWW Server; additionally, users at database sites can use the local Supervisor Module to enter and maintain their data without using the Internet.
The WWW Server and Clients. All interaction between end users
(e.g. Neuroiologists, Fig 7) and the ZfishDB is via the WWW Server.
Users may use any WWW client (MOSAIC, Lynx, Netscape, etc.) on any platform,
like SUN, Silicon Graphics, IBM PC and Macintosh workstations, that can run
this client software. The WWW Server extracts user input, either
database queries or new data submissions, from the data stream transmitted by
the WWW client and passes it to the Data Retrieval and Processing
Engine. It then waits for the Data Retrieval and Processing Engine
to return an appropriate response, formatted as an HTML page. Forcing all
database access through the WWW Server achieves two important
implementation goals. First, it supports ubiquitous and uniform access; users
need only have a viable (i.e. capable of HTML form handling) WWW Client
to access the database. No special purpose software is required. Second, the
data cannot be entered or retrieved except through the WWW Server. This
enforces the common logical data model and dictionary.
We plan to use an essentially unmodified HyperText Transfer Protocol (HTTP)
server as the WWW Server for the ZfishDB. HTTP server software is
publicly available from both CERN and NCSA; commercial versions (e.g. Netscape)
are available as well. We will implement this module of the ZfishDB by writing
scripts to extract and decode user input, interact with the underlying
processing components of the ZfishDB, and return results to the client.
Challenges and Limitations. Submitting image data presents a
particularly interesting challenge given the WWW front end to the ZfishDB;
there is (currently) no support in HTML, the markup language interpreted by WWW
clients, for passing anything but text strings from a WWW client to a WWW
server. There are a number of possible solutions to this problem; to avoid
necessarily extending the HTML language and creating special-purpose WWW
clients for all possible computing platforms, we propose a low-tech solution.
The user interface allows researchers to transfer (e.g. using ftp) image files
to a special directory on the server. They then formally submit the data using
an HTML form (see Fig 4) specifying the image file name, and the interface
integrates the image and its corresponding descriptive information into the
database. We believe this will be a non-issue soon; requests to extend the HTML
standard to support a "file" input type are currently under review by the HTML
standards committee.
Another challenge arises because HTTP is a stateless protocol, based on
discrete data retrieval operations. Thus, as a query is constructed,
information is not maintained as the user moves from one HTML page to the next,
setting various constraints. In one solution, the system could embed (e.g. in a
TEXTAREA field) the state information within the HTML pages it dynamically
generates and returns to the user. This information would then be returned to
the system when the user submits the forms contained in those pages.
Alternatively, the server could create a unique "state" file for each user. The
name of this temporary file would be embedded in all transactions; after some
interval of inactivity, the file would be purged.
Another potential limitation is that the use of a single WWW server will affect
reliability and performance if the server goes down or is
overloaded by requests. The parallel processor on which the system will be developed and tested, will serve
as a backup for failures and the WWW servers at the other
database sites will provide additional access points to the system, thus sharing
the load.
Data Retrieval and Processing Engine. Queries or
data submissions passed to the system via the WWW Server are processed
by the Processing Engine (Fig 7). Processing differs according to the
user's action: a broad database query, a request for a specific record, or a
data submission.
- Query Processing: For general queries, the constraints specified by
the user are processed into an appropriate SQL database query by the SQL
Generator. This query is then passed to the Distribution Manager,
which contacts a remote site and submits the query to the databases at that
site. Importantly, this component is based on a "lazy" strategy in the
interest of efficiency; the returned values are merely a list of record
(object) identifiers rather than the complete data records. In general, we can
expect the user to refine the search several times to narrow down the set of
returned records. Only when the user requests to view particular data items,
are the requested data retrieved. The results from the site is
passed to the Results Integration module and integrated
into a single result. Record identifiers from the database are augmented to
identify site of origin, sorted by data type, and then passed to the HTML
Synthesizer for formatting into HTML output. For instance, each record
identifier could be instantiated as a hypertext link to actual data.
- Viewing a specific record: After querying, further
constraining, and requerying the database several times, the user may want to
browse the resulting data set, viewing individual records. When the user
selects a specific record, a request is sent to the Record Manager,
which parses the request into a database request and sends it to the
Distribution Manager which passes it to the appropriate database. As
before, the returned record passes through the Results Integration
module and is then formatted (e.g. Fig 4) by the HTML Synthesizer.
- Submitting New Data: The Processing Engine also
supervises submissions of new information. Data extracted from the WWW
Client data stream by the WWW Server is passed to the Input
Manager, where it is checked for authenticity by verifying the submitter
name and password. Equally important, the data are checked for correctness and
consistency; although the rigid interface provides for syntactic correctness,
semantic errors are possible and must be detected. For example, if an image is
described as both a whole-mount and a reconstruction, the submission should be rejected and the user notified of the contradiction. If everything is in order,
the submission is passed to the Distribution Manager and submitted to
one of the site databases for incorporation. The site at which the data are
actually stored is irrelevant to end-users; the Distribution Manager can
heuristically decide where to place the data based on load at
individual sites, etc.
The bulk of the Data Retrieval and Processing Engine will be implemented
in C++ by developing logical objects corresponding to the various components
(e.g. SQL Generator, HTML Synthesizer, Fig 7). The Processing
Engine acts as an interpreter, translating the inputs received from the
WWW Server into various database access requests, integrating the
results, and formatting them as HTML output. We have extensive experience
[Fickas89;Douglas92] with the programming tools and techniques required to
implement such interpreter functionality.
Challenges and Limitations. Implementing the Distribution
Manager is a key challenge. The Distribution Manager must interact
with the database components at the various sites, distributing database
queries and collecting the results. Because these interactions are primarily
file transfers between UNIX hosts, we propose to use a well-understood IP
protocol like UUCP. Alternatively we could use a custom network file server as
we have previously developed.
Supervisor Module. The Supervisor Module provides tools to
assist the site supervisor in maintaining the system. For example, the site
supervisor might need to remove erroneous data from the system or restructure the database
in response to changes in the logical data model as determined by the zebrafish
community. The Supervisor Module automates integration into the ZfishDB
of data posted to public databases like GenBank and MEDLINE. It provides
bookkeeping operations for collecting and reporting database access activity.
We will also implement an auxiliary X-Window based interface to the ZfishDB
logically equivalent to the interface provided by the WWW Server. Thus,
data submission for local researchers can be streamlined, bypassing the WWW
Server. This is particularly sensible because we propose to place sites in
the labs producing the most data.
We plan to implement most of the Supervisor Module in C++ using OSF
Motif to create the user interface. Interfacing with the Illustra database will
be straightforward because Illustra provides an API for C/C++.
Databases. The database components of the ZfishDB store, organize
and provide access to the submitted data. During initial system development,
the ZfishDB will begin with a single database at the Eugene site. We will also
use a parallel processor in the Computational Science Institute for testing the client/server WWW software and as backup for the
database. Later, other sites will be added to expand the system and to
streamline submission of data to the database. The sites are
distributed geographically, making the ZfishDB is a "distributed
database", a single database system distributed over a number of nodes. Each
site in the ZfishDB implements the identical logical data model of the Illustra
database system. The
homogenous view of the ZfishDB presented to the user is constructed by the
Distribution Manager. This architecture contributes significantly to
system robustness and performance:
- Performance and Robustness: Query optimization [Andleigh92] is
greatly simplified because related objects and tables are not distributed
across sites. This modularity of function yields similar advantages for
transaction management.
- Security Risks: Because most transactions are mediated by the WWW
Server, there are only two users the database must accommodate, the Site
Supervisor and the system process associated with the Distribution
Manager. This greatly reduces security risks.
- Site Failure and Recovery: Because each site is independently
viable, network or site failures affect the system only by making query results
temporarily unavailable from that site.
- Expansion: New sites can be added easily; each becomes an independent
storage component.
We expect system performance to be limited primarily by the bandwidth, traffic
patterns, and transfer rates of the Internet. This limitation will affect
primarily image data. We plan to keep most in-line images under 20k, allowing
users to download higher resolution images on demand.
The Zebrafish Database
Continue on to Implementation of Specific Aim 2
Return to Table of Contents