|
[an error occurred while processing this directive]
|
Proposal to the Andrew W. Mellon Foundation
The Open-Systems FEDORA Repository Development Project
Introduction | Project Description
| Phase 1 | Phase 2 | Phase
3 | Implementation Plan | Budget
| Appendix A
Introduction
The University of Virginia Library has been building digital collections
since 1992. We have amassed a large collection that includes a variety
of SGML encoded etexts, digital still images, video and audio files,
and social science and geographic data sets that are being served to
the public from a collection of independent web sites that have very
little cross-integration.
We began searching in 1998 for a digital library management system that
could effectively meet both our current and future digital content needs.
Like many other libraries, we initially sought a vertical vendor solution
that provided a complete, self-contained package for delivering and
managing all digital content needs. We investigated a number of commercial
solutions, including IBM's Digital Library Software system (later renamed
Content Manager) and SIRSI's Hyperion digital media archive system.
We started our investigation with the requirement for a digital content
repository with a wide variety of features, including scalability to
handle hundreds of millions of digital resources, flexibility to handle
the ever expanding list of digital media formats, and extensibility
to facilitate the building of customizable tools and services that can
interoperate with the repository. Our view is that such repository functionality
is the core of a digital library system providing a means of uniquely
identifying each piece of digital content as well as identifying groups
of related content or collections. The remaining services and functionality
of a digital library system would then be built on top of this core.
Our investigations revealed a number of shortcomings in commercial
digital library products:
- Most products are narrowly focused on specific media formats that
offer good solutions for managing and delivering video or images but
lack adequate tools and support for structured (i.e., xml or sgml)
electronic texts or the ability to intermingle media types.
- Many products perform well at document management but offer no features
for dealing with video or images. None of the products we examined
adequately addressed the need to track and manage the array of ancillary
programs and scripts that play an essential role in the delivery of
that digital content.
- Many products fail to effectively deal with the complex interrelationships
amongst digital content. As an example, consider an electronic text
in the form of a five hundred-page book. The book consists of a single
file containing all five hundred pages of text marked-up using XML.
In addition to the XML file, there are also five hundred images that
represent the scanned pages from the original hardcopy edition of
the book. There are also twenty-five audio files that provide a recording
of the books content read aloud. To the librarian, all of these digital
media comprise the intellectual object known as the "book"
and all are closely related to one another.
- Finally, we found that few of the products attend to the critical
issue of interoperability, failing to provide an open interface to
allow sharing services and content with systems from other vendors
at other libraries.
Based on these investigations we decided to embark on an in-house development
effort. Modularity and use of open-system standards is fundamental to
our design strategy. Such modularity is essential for future evolution
through component replacement. We are convinced that an object-oriented
design is most appropriate, allowing us maximum flexibility, scalability
and, eventually, interoperability with other repositories. We are also
convinced that the Library should be providing tools to our users to
give them sophisticated access to our collections and to help them manage
their own collections.
In the summer of 1999, early in our design process, we discovered a
paper about the Flexible Extensible Digital Object Repository Architecture
(FEDORA) written by Carl Lagoze and Sandra Payette at Cornell's Digital
Library Research Group, describing the architecture that they had designed.
FEDORA is a modular architecture built on the principle that interoperability
and extensibility is best achieved by the clean separation of data,
interfaces, and mechanisms (i.e., executable programs). A FEDORA Repository
provides a general-purpose management layer for digital objects. In
their simplest form, digital objects are containers that aggregate mime-typed
streams of data (e.g., digital images, XML files, metadata), known as
datastreams. It should be noted that datastreams can be references to
external data - either disseminations of other FEDORA digital objects,
or service requests to remote data sources. This capability allows FEDORA
digital objects to serve as aggregators and value-added surrogates for
existing on-line digital content.
In addition to behaving in a generic manner, digital objects must be
able to mirror real-world entities by providing access methods that
make an object behave in a content-specific manner. For example, a natural
behavior for a book would be "Get Table of Contents." FEDORA
allows the association of rich and extensible behaviors with digital
objects by "plugging in" generic components known as disseminators.
Each disseminator aggregates references to: (1) a formally defined behavior
interface that defines a set of methods for a particular kind of digital
library resource (e.g. a Book interface), (2) an executable mechanism
that runs these methods, and (3) the datastreams that the execution
mechanism should use to fulfill specific method requests. These interfaces
and mechanisms can, themselves, be stored as digital objects, laying
the foundation for unlimited extensibility of the architecture. A major
strength of the FEDORA extensibility model is that clients can use the
generic methods (of the default API) to discover and invoke content-specific
methods defined on disseminators. The digital object facilitates the
invocation of these extended methods, returning customized disseminations
of content to the client.
With the Cornell group's help, we installed their research reference
software version of FEDORA and began experimenting with some of our
digital collections. We pretty quickly found that the reference implementation,
elegant piece of software that it is, was not what we needed for a large-scale
digital library. But we were convinced that FEDORA was exactly the conceptual
framework that we were looking for. So with the authors' help, we re-interpreted
the architecture and implemented it using an SQL database as the backend.
Since that time we have built a testbed that includes 500,000 data objects
including digital images and a wide variety of XML objects. We have
developed a variety of disseminators that provide a rich set of functionality
for electronic finding aids, TEI-encoded etexts of letters and books,
and for XML-encoded structured collections of art, architecture and
archeology images. We have also implemented three different object models
for images, one for multiple files for the various resolutions of a
single scan, one for single-file wavelet-encoded images and one for
page images that uses a single compressed TIFF file. In all three cases,
the user sees the images from one abstract point of view and is spared
the requirement of knowing their format.
Most recently, we have begun to do some stress testing of our implementation
using software that simulates simultaneous users requesting a realistic
mixture of different requests. We have been quite pleased with the results.
On a Sun Ultra80 two-processor workstation, simulating 20 simultaneous
users making requests with an average delay of 300 milliseconds, response
time averages are approximately one half second per request. Note that
for most of the XML object transactions this includes a server-side
rendering of the XML into HTML, a relatively processor-intensive action.
We are in the process of moving the repository to a four-processor,
dedicated server, where we will continue our testing. We plan to start
scaling our testbed up by duplicating the existing objects repeatedly,
running the user tests at 1,000,000 and 10,000,000 objects. We believe
that a repository that provides fast access to 10,000,000 objects is
a very good starting point for a practical digital library.
Top
Project Description
We believe that it is time to both start developing a practical implementation
of the work that we have been prototyping, and to explore and prototype
some of the more complex issues related to the more complete implementation.
We propose to do that with input from other members of our community
(see Phase 1), so that we develop a good general
solution as quickly as possible. We also would like to get the repository
software that we produce into the hands of other people who are ready
to use and evaluate it.
We are convinced that we are on the right track with our implementation
of FEDORA. It gives us the basic approach that we need to manage all
of the digital resources that we are accumulating, while delivering
a very high level of service to our users. And we believe that the extensibility
of the architecture will allow us to adapt to the rapid technological
changes and new content forms that are inevitable.
We request funding for this project from the Andrew W. Mellon Foundation
based on our belief that the project is closely aligned with the mission
of the Foundation of promoting the broad dissemination of scholarly
content. In the digital age, such broad dissemination is dependent on
core technical developments, the roots of which lie in the research
community. The original research and development related to FEDORA was
undertaken under the auspices of DARPA and NSF funded research at Cornell
University. Per the general understanding in such research projects,
the funding was available for initial concept development, prototype
demonstration, and reportage in the form of conference and journal papers.
On the other hand, such government-sponsored research funding is not
available for the subsequent stages necessary for moving from research
to deployable implementation and other aspects of technology transfer
including packaging and support. The Foundation's possible funding of
FEDORA work by Virginia and Cornell would consequently leverage several
years of successful government funded basic research and facilitate
the availability of the fruits of that research to the broadest community.
Such funding would also benefit from the fact that the NSF funding continues
at Cornell and would dovetail into the project as it matures. We are
confident that such pairing of funding mechanisms is the best possible
model for fostering state-of-the-art advances in digital libraries and
scholarly communication.
This project also would build from and directly support Mellon-funded
projects already underway at Virginia. The Supporting Digital Scholarship
(SDS) project, which concentrates on collecting the digital scholarly
projects that are being created by humanities scholars in the Institute
for Advanced Technology in the Humanities at Virginia, is built around
our prototype FEDORA implementation. That project has already informed
the design that underlies the project described in this proposal and
will continue to do so. The version of the repository that results from
phase 1 of this project (described below) should become available right
at the time that the SDS project delivers policy and technical guidelines
for collecting digital projects. This should allow us to implement those
policies immediately and begin formally collecting scholarly projects
into the digital library.
In the same manner, the basic working repository created with this grant
will deliver a full suite of management utilities to other Mellon-funded
projects underway or envisioned at Virginia. Our work will dovetail
with the digital imprint project approved for the University Press of
Virginia, and will be immensely useful for the American Studies Information
Community project (itself bringing in the Mellon-funded Early American
Fiction collection) that is one of the Open Archives Initiative projects
recently funded by the Foundation. The startup phases of these two projects
coincide with the detailed design and implementation phases of the repository
project, providing an opportunity for influencing the details of the
initial product by providing different content collection and delivery
issues to resolve. Then both projects would be able to move directly
to concentrating on using the repository to meet the specific needs
of publishers and American Studies scholars, respectively.
We believe this project is best undertaken in collaboration with our
colleagues at Cornell. We find the missions of our two groups to be
synergistic, spanning a continuum from basic research, through prototyping,
to eventual deployment of reference implementations. The Cornell group
works mainly in the basic research and prototype mode. FEDORA was originally
developed within this research framework, and NSF DLI2 funding currently
supports the basic work on policy enforcement and context sensitive
behaviors, which we will leverage as described later in this proposal.
The Virginia group sees itself functioning as a bridge between the computer
scientists doing digital library research and the libraries that are
building large digital collections. We believe that the collaborative
activities of this project will effectively demonstrate how digital
library research can be more immediately deployed in the libraries that
it is intended to serve.
We propose that we will form a research and development team composed
of people from Virginia and Cornell, with 1.0 FTE added at Cornell to
Lagoze's group, and 2.5 FTEs added at Virginia. The principal investigators
will be Thornton Staples, the director of the Digital Library Research
and Development group in the Library at Virginia, and Carl Lagoze, co-director
of the Cornell Digital Library Research Group. Also, people from the
Institute for Advanced Technology in the Humanities and from the Advanced
Technology Group in the Information Technology and Communications Department,
both at Virginia, will continue their work with FEDORA as members of
this team. The team will pursue a three-phase project, as detailed below,
with the goal of producing an open-source reference implementation,
which will be available to other libraries and practitioners as they
construct digital library systems. The first phase involves taking a
strong proof of concept (already done) and producing a package that
can be distributed and used in a variety of settings. The later phases
propose extending the results of ongoing research in order to fill out
the system with important functions that a sophisticated digital library
system needs.
Top
Phase 1
This phase will involve finalizing the specifications of the basic
FEDORA system, implementing that system, and testing it in a variety
of deployment scenarios. The resulting product will be an efficient,
scalable reference implementation that can be the basis for many different
development efforts, one that libraries with a reasonably sophisticated
technical staff can use to begin to build their digital library systems.
It will include a set of generic modular tools that provide a full set
of basic repository management functions. The time period for this phase
is assumed to be one year, probably from the time that the programming
effort begins.
We will continue to build our testbed at Virginia and we anticipate
having at least 1 million digital objects of a variety of types ready
to test the system that we will deploy in phase 1. An essential part
of this phase will be the participation of a select deployment group
(distinct from the development group) that will deploy testbeds of their
own materials at the same time. In each case, the participants listed
below either heard our presentations at the Association for Computing
in the Humanities and Digital Library Federation conferences or read
the article in DLIB Magazine where we described our work and came to
us to find out more.
Two of the participating institutions are also digital library groups
which will be evaluating the system from that point of view. The rest
of the participants are project-oriented humanities groups who will
be testing the repository system as a basis for supporting projects
rather than for building a general digital library. As a group we will
be evaluating the system specifications and planning the system evaluation
as the programming is being carried out. At the end of this phase we
expect to have at least six implementations of working digital object
repositories to evaluate. We also expect that many if not all of these
repositories will continue to give us a rich testbed for later phases
of the project.
The success of phase one will be determined by the success of the deployment
group (consortium participants) in deploying separate testbeds in each
of their institutions. Feedback from the consortium members and other
users after the public release of the software will be used to evaluate
version 1.0 of the software and will guide future enhancements to the
repository software.
We would like to keep the number of participants to ten or fewer, to
make the process more manageable. Currently, the participants include:
- Jon Dunn from the Digital Library Group at Indiana University,
- Lorna Hughes from the Humanities Computing group at New York University
- David Kahle and Greg Colati from the Digital Collections and Archives
Department at Tufts University
- Harold Short from the Humanities Computing group at Kings College
London
- Marilyn Deegan from the Refugee Studies Center at Oxford University
Following initial drafting of the specifications by the development
group and dissemination to the deployment group in the summer of 2001,
the work in this phase will include:
- Participating institutions will send a technical representative to
a two-day meeting in Charlottesville in the fall of 2001, where we will:
- demonstrate and discuss the Virginia implementation to make sure
that all of the participants fully understand the basic concepts,
see how the repository is being used by us, and understand the possibilities
for other installations;
- agree on a final version of the architectural specifications that
will be version 1 of the repository system;
- discuss the features defined for later phases, in order to sketch
out the next steps so that we can attempt to account for them in
the basic object architecture;
- agree upon the specifications for the testbed that each participant
will develop in step 3.
- The development team will create the system software. The functionality
of the first version will include:
- implementation of a complete basic repository architecture that
is based on the original FEDORA concepts;
- a management console that a repository manager can use to perform
all of the basic management functions;
- a metadata searching and indexing service that is compliant with
the Open Archives Initiative.
- Participants will each deploy version 1.0 of the system and build
a testbed of their own digital objects, as agreed upon in step 1.
- Appropriate fixes and small changes will be incorporated based on
the testbed experience and version 1.0 will be made available from a
publicly accessible open-source site.
Phase 2
The second phase of this project will concentrate on adding the functionality
needed for a repository that supports large-scale digital content creation,
storage and delivery efforts. This will involve enhancing and extending
the management utilities developed in phase 1, in addition to concentrating
on the development areas listed below. We expect that some of the participants
from phase 1 will be interested in the problems associated with large
scale production and will be interested in developing a new testbed
definition as this phase develops and deploying the new version when
it is complete. We will solicit new participants who are well situated
to evaluate the work as it progresses. We will also be interested in
continuing to work with groups that are interested in deploying smaller
repositories to evaluate how these additional functionalities can be
used effectively in those settings.
Security and Policy Enforcement - We assume that each digital
object in the repository should be able to have a variety of policies
associated with it. First among these policies must be those associated
with access control. But many other policies are possible, for example
preservation policies that describe the events and actions necessary
to maintain objects over time.
In the area of access control, we recognize the need to specify policies
that are both general-purpose and object-specific. Some policies may
be defined at the repository level and may address high-level operations
such as who can create or delete objects. Other policies may be tailored
to the nature of individual objects in the repository. Initially, we
will focus investigation in two areas:
- Flexible Policy Specification: Cornell's NSF-funded work is investigating
new policy definition languages that are both easy-to-use and expressive.
We plan to exercise that research in this project so that we may specify
access control policies that are customized to fit the nature of different
kinds of objects and usage scenarios. We plan for policies to be expressed
in both human and machine-readable formats.
- Extensible Policy Enforcement Mechanisms: To be effective, policies
must not only be expressed, they must be enforced. Thus, we will also
examine several mechanisms for enforcing machine-readable policies.
We also recognize that digital objects change and evolve over time
and that our enforcement mechanism must be extensible. We will investigate
mechanisms that are easily adaptable to changes in objects and their
usage context.
Collection Objects - We believe that item-level granularity
is not appropriate for all the functionality that we want to build into
our system. Indeed, there are a variety of repository functions that
should be implemented at an aggregated level, within what we call collection
objects. These objects would represent a group of related digital objects
and provide a place to describe and document a collection as a whole,
as well as to attach computer programs to be used for manipulating and
analyzing it. The relationship of collection objects with related items
will be either rule-based (for which criteria associated with the collection
object are used to locate the objects which are members of the collection)
or explicit (in which objects that are members of the collection are
enumerated). Collection objects might be used to generalize a specific
function across a class of digital objects; for example, a collection
object might be used to implement a function such as metadata searching
and indexing across its set of constituent objects by accessing a specific
datastream in those objects. We will also develop collection objects
that can act as templates for large classes of objects, providing a
way to streamline the process of updating large classes of objects.
Storage Management - We will develop a storage management system
that would allow the repository to control access to one or more file
systems that house local datastreams. The processes that create or update
a local datastream would, in addition to updating the repository, be
responsible for accepting the contents of the datastream from the user
and passing it to the file server.
The goal of phase two is to use the results of evaluations of version
1.0 of the software (conducted in phase one) to add new functionality
to make the software usable in large- scale production environments.
The features outlined here are a first impression of what those additional
features may need to include. We expect many of the repositories deployed
by consortium participants in phase one will provide valuable testbeds
to conduct additional testing of the new enhancements. We also expect
these testbeds to provide valuable feedback for both evaluating the
new features and for suggesting additional enhancements for the future.
Top
Phase 3
The third phase of this project will concentrate on extending the facilities
in the repository that provide more sophisticated delivery of end-user
experiences in a large scale digital library. This will include extending
the functionality of disseminators, adding services that are important
for collecting scholarly projects and publications, as well as overall
optimization of the system. As with phase 2, we hope that the deployers
from earlier phases of the project will be interested in continuing
with this phase. We will also solicit new participants who are well
situated to evaluate the work as it progresses.
Editions and Versioning of an Object - The repository must make
it possible to retain and provide on demand every version of an object
if desired. We propose to offer a standard way to make a new edition
of an object available as a separate object in the repository, as well
as to make it possible to track every change to an object within the
object itself. A new edition of an object will be a completely new object.
It will have a new PID, it's own metadata, etc. There will be a field
in the system metadata that contains the PID of the object from which
it was derived.
Versions of the components of an object will be kept in the object.
The create date for each version of each component of the object will
indicate the date and time that the version became current. The version
of the whole object on a given date and time could be disseminated by
giving the date and time as an extension of the PID. The version of
each component in the object with a create date and time most nearly
previous to the given date and time would be used in the dissemination.
Dynamic, Context Sensitive Behaviors - We envision scenarios
where the pre-defined disseminations on an object will not be appropriate
to a given usage context. In certain cases, our collaborators may wish
to re-use each other's objects in new ways. One option is for repository
managers to create new disseminators on objects to meet such needs as
they arise. Another interesting approach is to provide a mechanism for
exposing a special kind of structural metadata about an object that
enables a 3rd party to: (1) learn about the nature of the object's raw
content, and (2) access relevant parts of that content in a format that
facilitates re-use. In a way, we can think of this as enabling "just-in-time"
disseminators for an object.
We envision implementing this scheme by introducing a new service into
the repository architecture: a context broker service. We anticipate
a time when all of our collaborators are running repositories using
the repository software we develop. Each site can also run a context
broker service whose purpose is to contextualize the experience of objects
in other collaborator's repositories.
Efficiency and Scale Optimization - Though we will be attempting to
optimize each module of the system as we develop it, we believe that
we need to devote part of the last phase of the project to optimizing
the integrated system. We need to ascertain that the repository can
support hundreds of million of objects with 50 simultaneous users, in
a realistic combination of user requests and repository management processes.
If the proposed scale proves to be impossible, we will investigate other
strategies, such as coordinated, multi-repository installations.
The goal of phase three is to continue to evaluate and enhance the software,
building upon input received from consortium participants and others
in the open source community who are actively using the repository software.
We expect the version of software that emerges at the end of phase three
to be capable of supporting large-scale digital content and delivery
efforts. We also expect the software to be capable of providing the
necessary services that are important for collecting scholarly projects
and publications and provide tools for end-users to discover and manipulate
content in the repository. We anticipate the success of phase three
and the project as a whole will be judged by the experiences of consortium
members and others in implementing the software to manage large scale
digital collections. We also envision that the various implementations
of the software will offer rich testbeds for future projects.
Top
Implementation Plan
The project will extend over a three-year period anticipating approximately
a year to complete each of the three phases outlined above. Evaluative
input from the consortium participants about software features and performance
may necessitate changes to the planned Phase Two and Three activities.
Obviously, delays in hiring or unplanned technical issues may require
adjustments, but an approximate timeline of events includes the following:
- Year 1 (Phase 1)
- Conduct first meeting of participants and orient consortium members
to project activities timeline
- Finalize version 1 design and programmer implementation specifications
- Implement alpha version of software, leveraging existing code
from prototype
- Orient consortium participants on deployment of alpha testbeds
- Deploy alpha testbeds by each participating consortium member
- Obtain initial evaluation results from alpha testbeds
- Address concerns/problems from initial evaluations of alpha testbeds
- Issue open source public release of version 1.0 of repository
software
- Issue annual report summarizing progress
- Year 2 (Phase 2)
- Evaluate version 1.0 of software
- Conduct meeting of consortium members to assess feature priorities
and evaluate Year 1 activities
- Address security and policy enforcement
- Address collection objects and collection object management
- Address storage management system
- Release version 1.0.x of repository software with enhancements
- Issue annual report summarizing progress
- Year 3 (Phase 3)
- Evaluate version 1.0.x of software
- Conduct meeting of consortium members to assess feature priorities
and evaluate Year 2 activities
- Address editions and versioning of objects
- Address dynamic context sensitive behaviors
- Optimize for efficiency and scale
- Release version 1.1.x of repository software with enhancements
- Conduct meeting of consortium members to evaluate project
- Issue final report summarizing project
Top
Budget
The costs associated with this project will predominately be the costs
of personnel. We are requesting $1,000,128 from the Mellon Foundation
to provide 3.5 full time equivalent staff, including a Technical Coordinator
and staff to work on the design and programming of the system proposed,
plus funding for equipment for those people and funding to provide travel
expenses for 4 meetings of the development team per year for each of
three years.
The Technical Coordinator position will be in the Digital Library Research
and Development (DLRD) and will report to Thornton Staples. We expect
this person to participate in design and implementation discussions
with the research and development team and to be the primary point of
contact for the deployment group as they begin to deploy the software.
He or she will test the software as it evolves using the Virginia testbed.
This person will not necessarily be a high-level programmer but will
need to be very technically sophisticated, as well as a good communicator
and organizer. We see this position as key to organizing the project
and keeping all of the participants in sync. This person will coordinate
all of the activities associated with the project, including organizing
meetings, conference calls and other communications, as well as overseeing
any administrative needs.
The 2.5 FTEs of programming time will be divided between Virginia and
Cornell. These will be high-level programming positions that we believe
are necessary to design and develop the software required for the proposed
system.
- 1.5 FTE - programmers/Virginia: The 1.5 FTE at Virginia will be
divided between the DLRD (1.0 FTE to be supervised by Ross Wayland)
and the Advanced Technology Group (ATG) (.5 FTE to be supervised by
Tim Sigmon, director of the group). We believe that by placing these
positions in these three software development groups we strengthen
the connections among them to continue the collaboration that has
produced the prototype. Note that 25% of Staples' time and 50% of
Wayland's that has been committed to developing the prototype will
continue to be devoted to this project. Also, the ATG will match the
.5 FTE from the grant, plus Tim Sigmon will continue to dedicate 10%
of his time to the project.
- 1.0 FTE - programmers/Cornell: Carl Lagoze will directly supervise
the 1.0 FTE to be added to the Cornell Digital Library Research Group
(CDLRG). This will include moving 50% of Sandy Payette's time from
another grant project plus adding another .5 FTE. Funding will be
provided to Lagoze through a sub-contracting arrangement (see attached
letter). Lagoze will contribute 5% of his time to the project.
The commitment that has been made by the participants in the deployment
group was to cover expenses of their participation themselves. We certainly
will use the Technical Coordinator position to make that commitment
as easy as possible. Also, we have a verbal commitment from Daniel Greenstein,
the director of the Digital Library Federation, to cover the expenses
of the fall meeting.
Top
Appendix A:
- Architectural Specifications
- General Object Model
The core of the specification centers around the model used to define
an object. Understanding the meaning of each of these components
and how they interact is critical to the successful design and implementation
of the repository system. To establish a common vocabulary we have
provided definitions for key components of the object model and
the repository.
- Object - From an architectural perspective, a digital
object consists of a number of components that include a Unique
Identifier (UID), an Object Map, System Metadata, one or more
Disseminators, and one or more Datastreams.
- Unique Identifier (UID) - A UID is the unique persistent
identifier for the object, maintained by the repository software.
Note that the UID is defined for internal use in the repository
and may or may not be exposed to the outside world.
- Object Map - An Object Map describes the internal
structure of an object. It identifies each component in the
object, and defines relationships among components. An important
function of the object map is defining the roles that datastreams
play in the context of specific disseminators (i.e., which
datastreams are used by which disseminators). Each object
must contain a single Object Map.
- System Metadata - System metadata consists of a stream
of bytes that contain ASCII text marked-up with XML tags that
conform to a specific XML Schema/DTD. It records a minimal
amount of information about the object and its components
that are necessary for basic internal repository management
and indexing.
- Datastream - A Datastream is a component that consists
of a typed stream of bytes that adds content to the object
(e.g., a digital image, an etext, a program, metadata, a database,
a mapping or relational structure, etc.). An object can have
one or more datastreams. Each datastream must contain the
following:
- Name - identifier for the datastream
- Description - textual description of the datastream
- Content Type - MIME type of the datastream
- Control Type - There are two possible control
types:
- Internal - a datastream for content under the
direct control of the repository owner.
- External - a datastream for content that is outside
the direct control of the repository.
- Contents - a pointer to a MIME-typed stream of
bytes e.g., a local file system address or an http address.
- Disseminator - A Disseminator is a component that
associates a set of behaviors with an object so that clients
can obtain different views of the object's content (datastreams).
A disseminator may transform content, process content, or
prepare custom presentations of content. Each disseminator
contains a mapping to a Signature and a Servlet, both described
below. A Signature is a specification for a set of behaviors
that an object is able to perform. In object-oriented terms,
a signature is an abstract interface definition that can be
implemented by a program. A Servlet is a module that implements
the behaviors (methods) described by a signature, in a specific
setting. A servlet is comprised of one or more methods each
of which implements a behavior as an "action" upon
the object's content. Within any given signature and servlet
pairing, there is a one-to-one correspondence between each
method in the servlet and a method in the associated signature.
Incidentally, a one-to-many relationship exists between signatures
and servlets since the same signature can be implemented in
different ways by multiple servlets. In object-oriented terms,
a servlet implements the interface defined by its corresponding
signature.
- Signature - A signature contains one or more
methods, each consisting of :
- MethodName - identifier for the method
- SignatureMethodDescription - an end-user
oriented description of the method/behavior
- MethodReturn Type - MIME type returned by
the method
- MethodParameter(s) -
- ParameterName - identifier for the parameter
- ParameterDescription - a description of the parameter
and how it is used
- ParameterDefault Value - default value for the
parameter (optional)
- Servlet - A servlet contains one or more methods,
each consisting of:
- MethodName - identifier for the method (identical
to method name in the corresponding signature)
- ServletMethodDescription - a description
of how this specific behavior is implemented; this
is different from the end-user oriented description
of the behavior defined in the signature
- MethodReturn Type - MIME type returned by
the method (identical to the method return type in
the corresponding signature)
- MethodParameter(s) - (identical to the parameters
in the corresponding signature)
- ParameterName - identifier for the parameter
- ParameterDescription - a description of the
parameter and how it is used
- ParameterDefault Value - default value for the
parameter (optional)
- Action - a machine interpreted instruction
on how to invoke a piece of executable code that performs
the specific behavior
- Dissemination- A dissemination is the result of executing
a specific behavior defined by a signature that returns a MIME-typed
stream of bytes. In order to perform a dissemination, a repository
must receive a request that contains four pieces of information
:
- Name - an identifier for the object
- Signature name - the identifier of a signature to which
the object subscribes. Again, a signature describes a set
of possible behaviors for the object.
- Method name - the name of a behavior of the given signature
- Parameter name/value pairs - zero or more parameter name/value
pairs that may be required by the given method name
The ability to perform disseminations is a fundamental requirement
of the repository architecture. When a client issues a dissemination
request, the repository software must be able to (1) locate
the corresponding servlet for the request, (2) execute the appropriate
servlet action, and (3) deliver the result to the requestor
of the dissemination.
From an access perspective, a client does not need to know the
architectural details of an object. The only thing an accessing
client must do is issue a valid dissemination request (with
the above four arguments).
- Open Protocols
- Open Repository Dissemination Protocol (ORDP)
The repository will define an open protocol for obtaining disseminations
from objects. This will promote interoperable access to digital object
content among distributed repositories. The protocol will be implemented
over HTTP.
- Open Archives Initiative (OAI) Protocol
The repository should be compliant with the OAI protocol so that metadata
can be harvested by other services. To this end, we will build an
OAI component in the repository. This component will respond to all
valid OAI requests and will harvest metadata from all objects that
implement Metadata Disseminators. The OAI-compliant repository can
expose metadata to both companion service components (see Indexing
and Search Service below), and to 3rd party external services.
- Mechanisms for Mediating Communication
The repository must have mechanisms that enable access to external
services. From the perspective of accessing an object, there are two
distinct scenarios where communication must be mediated:
- Accessing datastreams - datastreams are stored as pointers within
the repository and require a supporting mechanism to enable the
object to retrieve the contents of that remote referenced datastream.
- Executing disseminations - the dissemination of a specific behavior
of an object requires executing the action specified in the appropriate
servlet module. The execution of the servlet module will require
a supporting mechanism that may be different than that used for
accessing a datastream.
Currently, the UVA implementation supports an HTTP-mediating mechanism
for accessing datastreams and executing disseminations. It is highly
desirable to consider offering mechanisms that mediate other protocols
(e.g., Z39.50) so the repository can utilize remote services not
directly accessible through HTTP.
- External Application Management Services
The execution of disseminations by the repository requires running
a specific chunk of code that is referenced through the disseminator's
associated servlet module. Running this piece of code may require
one or more application services for the host server to accomplish
this task. For example, if the executable chunk of code is a cgi script
written in perl, then the host on which the servlet is implemented
must also provide a perl interpreter with which to run the perl script.
In this example, the external application service is the perl interpreter.
Developers writing servlet modules will know what application services
will be required by their servlets. The repository must provide a
means of installing servlet modules into the repository environment.
This includes the storing of metadata about external application services
used by one or more servlets modules. To ensure the integrity of the
execution environment, the repository will periodically run tests
to detect any undesirable changes in the external application services.
System managers and developers will be able to query the repository
for information about which services are currently supported by the
repository, and about dependencies between servlet modules and external
services.
- Indexing and Search Service
This companion service would provide a way to index and search XML-encoded
metadata for all objects in the repository that contain Metadata Disseminators.
The service will harvest metadata from the Repository using the OAI
protocol. Harvested metadata would be made available to both internal
and external XML indexing and search engines.
- Logging and Audit Trail
Because all actions of or on an object are disseminations of that
object, it seems natural to keep track of the use of objects by creating
log records for each dissemination. All of the parameters of the dissemination
call must be kept, including: the UID of the object, signature name,
method name, and parameter name/value pairs. In addition, as much
information about the character of the requestor as possible should
be kept as well, including: IP address, communications protocol, etc.
The logging mechanism should be configurable similar to that used
in apache web server logs.
- Repository Management Utilities
Given the Object Model, the function of the repository is to store,
retrieve, index (possibly multiple indices), and maintain all the
objects in the repository. To use the repository effectively, a suite
of repository management tools must also exist that enable system
administrators and library personnel to interact with objects stored
in the repository.
Phase 1 of the project would include the development of a "control
panel" for the repository that allows the repository manager
to carry out all of these functions. However, this list of utilities
should all be available as modules to be used in a variety of other
processes.
- Create an Object
- Create a UID
- Create Object Map
- Create System Metadata datastream
- Modify an Object
- Modify UID
- Modify Object Map
- Add a datastream
- Modify a datastream
- Delete a datastream
- Add a disseminator
- Modify a disseminator
- Delete a disseminator
- Add access policy
- Modify access policy
- Delete access policy
- Delete an Object
- Remove an object "appropriately"
Questions:
- What does it mean to completely remove an object?
- Is the UID to be deleted referenced anywhere else in the
repository?
- Do you retain the UID after deleting an object?
- How to handle remote referenced datastreams in a deleted
object?
- What to do if this object is the last one referring to a
servlet and/or signature (i.e., should removing an object
also mean removing any associated signatures and servlets)?
- Access an Object
- List map of an Object
- For internal datastreams, provide a listing of the header
information only.
- For remote referenced datastreams, provide a listing of
the header information and the reference to the resource.
- List the methods of a given Signature or Servlet and all their
components)
- List access policies
- List contents of a datastream
- Disseminate an object (i.e., execute a specific behavior)
- Disseminate a list of objects
- Search for Objects
- Search for object based on repository component information
(e.g., UID, signature name, servlet name, datastream name)
- Search for objects based on metadata information (e.g., creator,
title, subject,etc.)
All repository management utilities should provide a batch processing
interface that enables batch loading, modifying, and removing of
complete objects or any of their components.
- Implementation Details
- In the first phase, we will develop a system that can handle at
least 1 million objects while sustaining average response times
of less than 2 seconds per transaction for 20 simultaneous users,
running on a 4-processor SUN Enterprise 420R server, and using a
freely available SQL database package. These conditions will be
simulated using JMeter, or an equivalent software user simulation
system. The transactions will be a realistic mix of a variety of
user requests and management processes.
- For the repository back-end database, we will provide default
bindings for one free-ware SQL database package (e.g., MySQL) and
one commercial package (e.g., Oracle).
- For the repository indexing and search services, we will provide
default bindings for one free-ware XML package (e.g., SGREP) and
one commercial package (e.g., XYZFind).
- All repository software development will be done using java.
- The resulting software will all be made freely available under
a GPL open-source license.
Top
|