How we acquire it: We buy the software and support from Endeavor.
We cannot legally redistribute the software, and cannot practically
modify it, except through the customization options the software provides.
Local use and support: Used by many librarians, especially those
in cataloging, collection development, and circulation.
Related technologies: Voyager maintains its data in an
Oracle database.
Cataloging information is in the MARC format.
WebVoyage provides a Web-based front end
to Voyager's cataloging search capabilities.
Local contact: Sandra Kerbel, Director, Public Services (skerbel@pobox.upenn.edu)
What it is: Software that allows our users to search our
catalog over the Web.
Software provided by: Endeavor Information Systems (see above).
How we acquire it: We buy the software and support from Endeavor.
We cannot legally redistribute the software, and cannot practically
modify it, except through the customization options the software provides.
Local use and support: This is the primary software our
patrons use to access our catalog. (It's also possible to search our
catalog via Z39.50 clients. Some local applications also
have direct access to the database via
SQL.)
What it is: The standard format for library catalog entries. Can
also be used to store structured metadata similar to bibliographic data.
Software using this format: Voyager (and WebVoyage, the software
that runs Franklin)
Format specified by: The Library of Congress (see their MARC Standards Page).
Local use and support: Used most extensively in the Information
Processing Center (for cataloging). We are also considering using
MARC records (with some new field definitions) for metadata for such
items as digital images and electronic journals. Most librarians have
some degree of knowledge and experience of MARC.
Local documentation: See our local Voyager documentation for some information
on how we use MARC locally.
Local contact: Carton Rogers, Director, IPC (rogers@pobox.upenn.edu)
What it is: An SGML-based
format for describing archival finding aids.
Format specified by: The Library of Congress and the
Society of American Archivists (see their official EAD web site).
Local use and support: The Rare Books and Manuscripts division
has prepared some finding aids using EAD, and has provided them to
RLG, but currently we don't have software for processing EAD directly.
(We have some translations of some EAD descriptions into HTML.)
We hope to acquire SGML/XML software soon that will allow us to
make better use of our EAD descriptions.
Local contact: Delphine Khanna, Digital Projects Librarian (delphine@pobox.upenn.edu)
We recommend that large bodies of information that follow a highly
structured pattern, or that should be presented multiple ways,
be maintained in a database,
instead of being maintained
as individual HTML documents.
We have tools, such as Cold Fusion,
that can automatically turn database information to HTML documents.
Local contact: Mike Winkler, Web Manager (winkler4@pobox.upenn.edu)
What it is: A format for publishing documents, designed
primarily for encoding their appearance on screen and on paper.
Software using this format: Adobe's Acrobat suite, various
third party tools (see PDFZone
for a list of PDF-aware tools)
Format specified by: Adobe. The specification is public, though
rather complicated. It
can be found from this page.
Local use and support: Used for our Oxford University Press
history on-line books. Also used in some other projects, and for
publicity materials.
Notes on use: This is the preferred format for
documents where the exact "look" is important, since we expect it
will be supported for a long time, and that if a new format replaces
it, a migration path will be available. PDF is itself a "successor"
to Postscript, and most Postscript documents can be migrated to PDF
with little difficulty.
(We don't recommend using Postscript for archival purposes.)
PDF is not recommended at this point for highly structured documents;
for those, use XML or some other format designed for structured data.
Local contact: John Mark Ockerbloom, Digital Library
Architect and Planner (ockerblo@pobox.upenn.edu)
What it is: An emerging standard
format for representing structured documents
and data.
Software using this format: Recent versions of Internet Explorer, and a large number of tools intended for programmers, document authors, and database managers.
Format specified by: The World Wide Web Consortium (see their official XML Home Page). The basic XML specification is now standardized; various formats related to XML, such as XML
query, schema, and pointer formats, are still under development.
Local use and support: Not in production use in the Library yet,
but we may use it as the basis for representing data for various
projects we are now developing. We are acquiring a toolset (DXLS)
from the University of Michigan for managing digital library repositories
that uses XML extensively.
Local contact: Delphine Khanna, Digital Projects Librarian (delphine@pobox.upenn.edu)
What it is: An older standard
format for representing structured documents
and data, that was the predecessor to XML.
Format specified by: ISO 8879, but that's not the
place for newcomers to start. For general overviews of
SGML, plus links to more information, see
SGML:
Introductions and Overviews at Oasis.
Formats defined in SGML: include HTML,
TEI,
and EAD. However, in many cases XML
versions of these formats are now available, or are under development.
Notes on use:
SGML is a more complex language than XML.
This means that writers of SGML documents have more flexibility than
writers of XML documents. Unforunately, it also means that it can be
a lot more complicated to parse and work with SGML documents. Therefore,
SGML is gradually being supplanted by XML, a more strict form of markup
for structured documents that is also easier to parse and interpret.
Since there still are many SGML documents out there that are not
XML-compatible, we still need some SGML-enabled tools to work with them.
However, new projects should use XML-compatible formats when feasible.
Local contact: Delphine Khanna, Digital Projects Librarian (delphine@pobox.upenn.edu)
What it is: An SGML-based format for representing text documents,
designed primarily for encoding their logical structure.
Format specified by: The Text Encoding Initiative Consortium
(see their web site)
Software using this format: includes DLXS, Panorama
from SoftQuad, and other SGML-aware tools.
Notes on use:
TEI has been used to encode text transcriptions in many academic
etext projects, including those at Virginia, Indiana, and UNC. Most
of these projecst have to also provide translation to HTML, since Web
browsers typically do not support direct display of TEI documents.
The original "official" TEI is a
SGML-based format. Because of its complexity, though,
a subset known as TEI Lite was created, which is what many electronic
text projects use. Efforts are underway to make XML versions of the
TEI formats, but the XML versions are not yet official.
Local contact: Delphine Khanna, Digital Projects Librarian (delphine@pobox.upenn.edu)
What it is: A proprietary, but widely used format for word processing documents.
Software provided by: Microsoft
How we acquire it: Through a site license (possibly with
a limited number of installations). We cannot modify or redistribute it.
Local use and support: Available on most Windows and Mac
staff desktops. Supported by Systems and ISC.
Notes on use: While Word is often useful
for internal communications, the format still cannot be read by many outside
users, is not specified in a public document, and is subject to repeated
change. Therefore, documents
meant either for public consumption, or to be kept more than short-term,
should be provided in more stable and portable formats.
Local contact: Library technical support, libtech@pobox.upenn.edu
What it is: A standard format for representing images,
suitable for archival use.
Format Specified by: Adobe. (They inherited it from Aldus, who
specified it after consulting with a number of imaging vendors). The
specification for the latest standard version (6.0, standardized in 1992)
can be found
in this PDF document from Adobe.
Adobe does not appear to be maintaining
a full TIFF home page, but see www.libtiff.org for pointers to documentation and free software.
Software using this format: includes most full-featured graphics
editors. Most Web browsers do not have built-in TIFF support, but instead
spin off a viewer application to display TIFF images (such as xv on Unix,
or Imaging for Windows). There are also scripts available for Web servers
to convert TIFFs to GIFs or JPEGs on the fly.
Notes on use: TIFF is an broad enough format that it accommodates
several different ways of encoding images. Some of these encodings may
involve lossy compression or a limited color palette. When using TIFFs
to archive images, one should make sure that one is not using a lossy
or limited TIFF encoding.
Local contact: Delphine Khanna, Digital Projects Librarian (delphine@pobox.upenn.edu)
What it is: A group that specifies a popular format
for representing still images, using
a gracefully degrading compression scheme.
Formats specified by: The Joint Photographic Experts Group
(hence the acronym JPEG). See their website
for official information about the JPEG formats.
Software using JPEG formats: Most full-featured graphics
editors, and most graphical Web browsers, support JPEG's basic
image format, JFIF (which is what most people think of as JPEG). Support
may be more limited for other JPEG formats.
Notes on use:
JPEG (okay, JFIF)
is particularly useful for displaying photographs and other images on the
Web that don't use a limited color palette or sharply-defined boundaries.
It uses a compression algorithm that can be optimized either for
image quality or compactness. However, since this compression loses
information, this format should not be used for archival storage.
JPEG is working on new standards, including JBIG2 and JPEG2000,
that support lossless compression, and wavelet compression (a powerful
compression technique also used by MrSID). These standards are not
yet finalized, but may eventually become important image formats.
Local contact: Delphine Khanna, Digital Projects Librarian (delphine@pobox.upenn.edu)
What it is: A format for representing still and
(simple) animated images, used widely in Web browsers
especially for line drawings and diagrams.
Format Specified by: CompuServe, last updated in 1990.
CompuServe doesn't seem to maintain a Web site on GIF, but the
specification can be found several places on the Net, including
as
a text file on the W3C site.
Software using GIF formats: Virtually all graphical
Web browsers and graphical editing programs. Some freeware does
not support GIF, due to patent concerns.
Local use and support: GIF remains the primary format
for Web page icons and images (except for photographic images)
on the local Library Web.
Notes on use:
GIFs support frame-by-frame animation, and transparent areas.
However, no more than 256 colors can appear in a single GIF, making
the format unsuitable for color photographs or other images that require
fine color gradations. The format does work well for line art and simple
icons.
Most GIFs are encoded using a compression
algorithm that is patented by Unisys. There has been some controversy
over Unisys' enforcement of the patent, which as of 1999 included a demand
for licensing
fees to be paid by Web sites that could not document
that their GIFs all came from Unisys-licensed software.
(The graphics programs that the Library purchases are licensed by Unisys.)
PNG has been invented
as an patent-free alternative to GIF, but to date has not caught on as
widely as GIF has. The patent for the compression algorithm used
by GIFs expires in June 2003.
Local contact: Mike Winkler, Web Manager (winkler4@pobox.upenn.edu)
What it is: A format for representing
highly compressed images, and a set of tools to display and manipulate them.
Format standard and software provided by: LizardTech
How we acquire it: Some of the software (like the
plugin viewers for MrSID, and a low-volume image server that serves
images to ordinary web browsers) is free;
other components (like the encoder) are sold. The format specification
is proprietary, and not published.
Local use and support: We are using the image server on
an experimental basis, and hope to make it the basis for delivery
of fine arts slide images.
Notes on use: Because MrSID is a proprietary, closed format,
and involves lossy compression,
it should not be used as an archival format for images. For the fine
arts slide project, we are using TIFF as the archival version.
Local contact: Delphine Khanna, Digital Projects Librarian (delphine@pobox.upenn.edu)
Local use and support: Used throughout the Library Web by web
developers.
Notes on use:
URLs, once announced, are often copied onto many pages, which causes
problems when they break. If local resources that are referred to by URL,
one should avoid changing the URLs unless absolutely necessary. For
more persistent Web references, consider using
Handles, or other persistent identifiers like
PURLs, if they are available for the resource.
Local contact: Mike Winkler, Web Manager (winkler4@pobox.upenn.edu)
What it is: A persistent identifier and reference for electronic
documents; more stable, and less fragile, than a URL.
Format specified by: The Corportation for National Research Initiatives (see their Handle System site).
Software using this format: We have a Handle Server, provided
by CNRI, which will take Handles encoded as URLs and redirect browsers
to the actual location of the resource referred to by the Handle. Major
browsers do not directly support Handles at this time, but a plugin is
available for direct resolution.
Local use and support: Although we do not yet support Handles
in production use, we plan to use them first to track electronic journals,
and then later on use them as identifiers for other digital resources
we create. Rules for assigning Handles are in preparation; talk to the
local contact below for more details.
Local contact: John Mark Ockerbloom, Digital Library Architect and Planner (ockerblo@pobox.upenn.edu)
What it is: A protocol used to search databases, adopted
as a standard in many library databases.
More information to come.
Scripting and programming
General notes
on the use of scripting and programming languages:
We strongly advise that information used by scripts or programs
be maintained separately from the programs themselves, in a standard
formats, and not simply embedded in the program source code. Separation
of information from tools
makes the information much easier to maintain over the long term, and
also allows the information to be reused in other contexts.
Home-grown programs can be difficult and costly to maintain. Consider
whether an existing standard program can be used in place of writing your
own program. If you do write your own, make sure you plan for
its long-term maintenance (keeping in mind that other people
may have to maintain it).
Except where noted, the Library as a whole does not officially
support any of the following languages or programs (though individual
departments might).
What it is: A scripting language designed
for use on Web pages or servers.
Software using this language: Major graphical Web browsers will
run JavaScript programs, unless users have turned off JavaScript features.
Language and tools provided by: Netscape (see
JavaScript Developer Central). The language has been submitted to a standards body
for further development. Microsoft has a competing product called JScript
which also implements the basic JavaScript interpreter, but includes
features that may not work on non-Microsoft browsers.
How we acquire the software: The JavaScript interpreter
is built into Netscape and Internet Explorer.
Local use and support: Pages that use cascading stylesheets
may depend on JavaScript for optimal appearance.
Some Library web pages also have used JavaScript, but some of these have
since dropped it in favor of server-side CGI
scripts (which don't require
any special browser configuration.)
Despite the name, JavaScript is a fundamentally different language
from Java. It does, however, share some language
constructs, and can be used to invoke Java programs.
Some of our web users don't run JavaScript, either because
of physical or computer limitations, or because of security concerns
(which still come up periodically). Web developers should try to accommodate
non-JavaScript users, and not require the use of JavaScript when
alternatives (using regular HTML
or CGI scripts) are feasible.
Local contact: Mike Winkler, Web Manager (winkler4@pobox.upenn.edu)
What it is: An object-oriented programming language designed
to be secure, and portable across different operating systems.
Software using this language: Major graphical Web browsers will
run Java programs, unless users have turned off Java features.
Java programs can also be run standalone on any
machine that has a Java Virtual Machine.
Language and tools provided by: Sun Microsystems (see the
java.sun.com Web site), with additional
tools provided by various third parties.
How we acquire the software: The main Java tools for Solaris
and Windows NT are provided free of charge from Sun, from the site above.
Apple provides a Macintosh version. The licenses may attempt to
limit some rights of modification, redistribution, and commentary (!).
Local use and support: Some digital library software is implemented
in Java, including our Handle server. We don't
provide official support for Java, but if you have any questions, you can
talk to the local contact below.
There is an ever-growing class library for Java that can be used
in local programs. See
java.sun.com for details.
Sun has canceled earlier plans of turning over control of Java to a
standards body. Standards controlled by a single company can carry
a higher risk of abrupt changes than those controlled by a standards body.
Local contact: John Mark Ockerbloom, Digital Library Architect and Planner (ockerblo@pobox.upenn.edu)
What it is: An interpreted programming language often used
for Web scripts, text processing, and rapid prototyping.
Language and tools provided by: Larry Wall and the Perl Mongers (see the Perl Mongers Web site).
How we acquire the software: The main Perl tools are
released as open-source software; we get it for free, and can modify or
redistribute it (though we wouldn't want to modify the language interpreter).
Many Perl
library modules are released under the same terms as Perl itself.
Local use and support: Various people in the Systems group have
used Perl for Web server scripts and rapid prototypes. We don't
provide official support for Perl, but if you have any questions, you can
talk to the local contact below.
Local documentation: This old lesson plan
still has some useful information for people learning Perl.
Related technologies: When invoked by Web servers, Perl
scripts are called via the CGI interface
On the plus side, Perl can be used to create rapid prototypes
of programs very quickly, and is especially well-suited for programs
that involve lots of manipulations of text strings. Well-written Perl
programs can often run unmodified on all major operating systems.
Perl is easy to learn for those already familiar with C and Unix programming,
less easy for others.
There's a large community of developers of open-source Perl software
(see below).
On the minus side, it is easy to write Perl programs that
are completely unreadable and unmaintainable, even by the original
author. The module, object, and documentation features of Perl 5 make
it possible to write and maintain larger programs than earlier versions of
Perl allowed, but authors still need to take pains to ensure that
their programs are written in a style that allows maintenance and reuse.
There is a large and growing collection of Perl modules
at CPAN. For many
functions, you can download and use one of these modules, instead of
trying to write your own code to do the same thing. You can also contribute
your own modules or improvements.
Local contact: John Mark Ockerbloom, Digital Library Architect and Planner (ockerblo@pobox.upenn.edu)
What it is: A versatile and efficient, but low-level,
programming language.
Language defined by: ANSI. A standard reference for
this language is The C Programming Language by Kernighan and
Ritchie.
Tools provided by: A variety of suppliers. The
Free Software Foundation provides
a free compiler (gcc) and debugger (gdb) for C that is widely used.
Commercial compilers and environments are also available.
How we acquire the software: There is no official support
for this language in the Library, but the FSF tools mentioned above are
open source and
can be downloaded freely from their web site.
Local use: C is used in some Systems projects
where efficiency or access to operating system-level structures is important.
Notes on use:
Many C environments introduce extra routines and constructs that might
not be supported on all platforms. However, the definitions and standard
library routines given in Kernighan and Ritchie (see above) should be
supported on all ANSI compilers. (Gcc is ANSI-compliant; the default
compiler on some systems, including Solaris, is not.)
C's low-level, close-to-the-machine programming model
is both its great strength
and its great weakness. Using C, you can write code that runs faster
and leaner
than virtually any other language, and that uses operating system features
not available in higher-level languages. On the other hand, using
low-level operating system features may lead to programs that are
not easily portable. C's do-it-yourself approach to memory management
makes it easy to write programs that crash by referencing memory that
hasn't been properly allocated, or programs that get increasingly bloated
as they run, requesting additional memory but not freeing memory that's
no longer needed.
It may require complex, time consuming programming to do proper memory
management, or support multiple threads of control, or do complex
expression searching or exception handling-- all features that are
built-in for other languages but not for C.
Local contact: John Mark Ockerbloom, Digital Library Architect and Planner (ockerblo@pobox.upenn.edu)
What it is: An interface that Web servers use to
invoke server scripts.
Software using this interface: All major Web servers can
run CGI scripts. The scripts themselves can be written in any language
that is supported on the machine on which the Web server resides.
CGI and other "server-side" scripts, unlike scripts that
run inside a user's browser, can typically be used by any Web browser.
Some languages (Perl is one) have prewritten modules you can use
for handling input and output for CGI scripts, so that you don't have to
write your own.
CGI scripts, if not very carefully written, can be exploited by
hackers to gain unauthorized access to our local computing resources.
See this section
of the World Wide Web Security FAQ for details. Contact our
local Web manager if you have any doubts about the safety of a CGI script
you plan to write or install.
Local contact: Mike Winkler, Web Manager (winkler4@pobox.upenn.edu)
Because Web browsers and servers are so ubiquituous, HTTP has
become the de-facto standard protocol used to request operations remotely
using a Web browser.
The exact details of HTTP are invisible to most
users of the Web, and authors of Web documents. However, if one is
writing CGI scripts, or writing one's own Web-enabled
servers or clients, it may be important to know how HTTP works.
Local contact: Mike Winkler, Web Manager (winkler4@pobox.upenn.edu)
What it is: A system (including protocol and software) for
managing diverse data formats, and to convert between them.
Software provided by: John Mark Ockerbloom wrote the
basic tools; CMU also has a conversion service on the Web that uses TOM
How we acquire it: The internal "broker" software is open-source
(freely available, modifiable, and distributable). We don't have much in
the way of user interfaces for it yet (CMU's conversion service software
is not available at this time).
Local use and support: We've received a grant from the
Mellon Foundation to develop TOM applications for digital preservation
and courseware in 2003 and 2004.