精华公布栏

发信人: bigapple (红富士), 信区: NTM
标题: XML, Java, and the future of the Web
发信站: 哈工大紫丁香 (Mon Aug 7 09:16:39 2000) , 转信

XML, Java, and the future of the Web

Jon Bosak, Sun Microsystems
Last revised 1997.03.10

Introduction
The extraordinary growth of the World Wide Web has been fueled by the ability
it gives authors to easily and cheaply distribute electronic documents to an i
nternational audience. As Web documents have become larger and more complex, h
owever, Web content providers have begun to experience the limitations of a me
dium that does not provide the extensibility, structure, and data checking nee
ded for large-scale commercial publishing. The ability of Java applets to embe
d powerful data manipulation capabilities in Web clients makes even clearer th
e limitations of current methods for the transmittal of document data.

To address the requirements of commercial Web publishing and enable the furthe
r expansion of Web technology into new domains of distributed document process
ing, the World Wide Web Consortium has developed an Extensible Markup Language
(XML) for applications that require functionality beyond the current Hypertex
t Markup Language (HTML). This paper [0] describes the XML effort and discusse
s new kinds of Java-based Web applications made possible by XML.

Background: HTML and SGML
Most documents on the Web are stored and transmitted in HTML. HTML is a simple
language well suited for hypertext, multimedia, and the display of small and
reasonably simple documents. HTML is based on SGML (Standard Generalized Marku
p Language, ISO 8879), a standard system for defining and using document forma
ts.

SGML allows documents to describe their own grammar -- that is, to specify the
tag set used in the document and the structural relationships that those tags
represent. HTML applications are applications that hardwire a small set of ta
gs in conformance with a single SGML specification. Freezing a small set of ta
gs allows users to leave the language specification out of the document and ma
kes it much easier to build applications, but this ease comes at the cost of s
everely limiting HTML in several important respects, chief among which are ext
ensibility, structure, and validation.

Extensibility. HTML does not allow users to specify their own tags or attribut
es in order to parameterize or otherwise semantically qualify their data.
Structure. HTML does not support the specification of deep structures needed t
o represent database schemas or object-oriented hierarchies.
Validation. HTML does not support the kind of language specification that allo
ws consuming applications to check data for structural validity on importation
.
In contrast to HTML stands generic SGML. A generic SGML application is one tha
t supports SGML language specifications of arbitrary complexity and makes poss
ible the qualities of extensibility, structure, and validation missing from HT
ML. SGML makes it possible to define your own formats for your own documents,
to handle large and complex documents, and to manage large information reposit
ories. However, full SGML contains many optional features that are not needed
for Web applications and has proven to have a cost/benefit ratio unattractive
to current vendors of Web browsers.

The XML effort
The World Wide Web Consortium (W3C) has created an SGML Working Group to build
a set of specifications to make it easy and straightforward to use the benefi
cial features of SGML on the Web. See the W3C SGML Activity page [1] for the c
urrent status of this effort. The goal of the W3C SGML activity is to enable t
he delivery of self-describing data structures of arbitrary depth and complexi
ty to applications that require such structures.

The first phase of this effort is the specification of a simplified subset of
SGML specially designed for Web applications. This subset, called XML (Extensi
ble Markup Language), retains the key SGML advantages of extensibility, struct
ure, and validation in a language that is designed to be vastly easier to lear
n, use, and implement than full SGML.

XML differs from HTML in three major respects:

Information providers can define new tag and attribute names at will.
Document structures can be nested to any level of complexity.
Any XML document can contain an optional description of its grammar for use by
applications that need to perform structural validation.
XML has been designed for maximum expressive power, maximum teachability, and
maximum ease of implementation. The language is not backward-compatible with e
xisting HTML documents, but documents conforming to the W3C HTML 3.2 specifica
tion can easily be converted to XML, as can generic SGML documents and documen
ts generated from databases.

An initial working draft for XML 1.0 [2] has been released for public discussi
on. A complete specification that includes methods for associating hypertext l
inking and stylesheet mechanisms with XML documents is scheduled for release a
t the Sixth World Wide Web Conference in April, 1997.

Web applications of XML
The applications that will drive the acceptance of XML are those that cannot b
e accomplished within the limitations of HTML. These applications can be divid
ed into four broad categories:

Applications that require the Web client to mediate between two or more hetero
geneous databases.
Applications that attempt to distribute a significant proportion of the proces
sing load from the Web server to the Web client.
Applications that require the Web client to present different views of the sam
e data to different users.
Applications in which intelligent Web agents attempt to tailor information dis
covery to the needs of individual users.
The alternative to XML for these applications is proprietary code embedded as
"script elements" in HTML documents and delivered in conjunction with propriet
ary browser plug-ins or Java applets. XML derives from a philosophy that data
belongs to its creators and that content providers are best served by a data f
ormat that does not bind them to particular script languages, authoring tools,
and delivery engines but provides a standardized, vendor-independent, level p
laying field upon which different authoring and delivery tools may freely comp
ete.

Database interchange: the universal hub
A paradigmatic example of this first category of XML applications is the infor
mation tracking system for a home health care agency.

Home health care is a major component of America's multibillion-dollar medical
industry that continues to increase in importance as the health care burden i
s shifted from hospitals to home care settings. Information management is crit
ical to this industry in order to meet the record-keeping requirements of the
federal agencies and health maintenance organizations that pay for patient car
e.

The typical patient entering a home health care agency is represented to the i
nformation system by a large collection of paper-based historical materials in
the form of patient medical histories and billing data from a variety of doct
ors, hospitals, pharmacies, and insurance companies. The biggest task in getti
ng the patient into the system is the manual entry of this material into the a
gency's database.

The coming of the Web has given the medical informatics community the hope tha
t an electronic means can be found to alleviate this burden. Unfortunately, ex
isting Web applications represent fundamentally insufficient models for an ade
quate solution. Hospitals have begun to offer the agencies a solution that goe
s something like this:

Log into the hospital's Web site.
Become an authorized user.
Access the patient's medical records using a Web browser.
Print out the records from the browser.
Manually key in the data from the printouts.
The knowledgeable reader may smile at this "solution," but in fact this is not
a joke; this is an actual proposal from a large American hospital known for i
ts early adoption of advanced medical information systems.

A slightly more sophisticated version of this "solution" envisions the operato
r reading the patient data from the Web browser and keying it directly into th
e agency's online forms-based interface in a separate window instead of making
a printout first. The only difference between this version and the previous o
ne is that it saves the paper that would have been needed for the printout. It
does nothing to address the root of the problem. A real solution would look m
ore like this:

Log into the hospital's Web site.
Become an authorized user.
Access the patient's medical records in a Web-based interface that represents
the records for that patient with a folder icon.
Drag the folder from the Web application over to the internal database applica
tion.
Drop it into the database.
However, this solution is not possible within the limitations of HTML, for thr
ee reasons.

The HTML tag set is too limited to represent or differentiate between the mult
itude of database fields in the mixture of documents making up the patient's m
edical history.
HTML is incapable of representing the variety of structures in those documents
.
HTML lacks any mechanism for checking the data for structural validity before
the receiving application attempts to import it into the target database.
One technically feasible way to implement seamless interchange of patient care
records is simply to require all hospitals and health care agencies to use a
single standard system dictated by the government (such an approach has actual
ly been suggested). In an environment where hospitals are going out of busines
s on a daily basis and many health care agencies are in deep financial difficu
lty, however, a scheme that would require them to replace their existing heter
ogeneous systems with a single new system en masse is hardly practical.

The other way to enable interchange between heterogeneous systems is to adopt
a single industry-wide interchange format that serves as the single output for
mat for all exporting systems and the single input format for all importing sy
stems. This is, in fact, the purpose for which SGML was initially designed, an
d XML simply carries on this tradition.

A number of industries, including the aerospace, automotive, telecommunication
s, and computer software industries, have been using hub languages to perform
data interchange for years, and by this time the process is well understood. T
ypically, the major players in an industry form a standards consortium tasked
with defining a Document Type Definition, which is the way in which the tag se
t and grammar of a markup language are defined. This DTD can then be sent with
documents that have been marked up in the industry standard language using of
f-the-shelf editing tools, and any standard application on the receiving end c
an validate and process them.

The XML solution is system-independent, vendor-independent, and proven by over
a decade of SGML implementation experience. XML merely extends this proven ap
proach to document interchange over the Web. Interestingly, the same day on wh
ich the first XML 1.0 draft was released also saw the formal announcement of a
n SGML initiative within HL7, the standards organization for health care IS ve
ndors, to develop a Health Care Markup Language designed to solve exactly the
kind of problem described in this example.

Previous vertical-industry efforts have shown that capturing data in a rich ma
rkup often has benefits beyond the immediate requirements of data exchange. In
a well-designed standardized patient data system, for example, specific infor
mation originally gathered in the course of a routine physical exam and tagged
<allergies>, <drug-reactions>, and so on would instantly be available to aler
t the staff of an emergency room that an unconscious patient from a distant ci
ty was allergic to penicillin. The ability of XML to define tags specific to a
n area of application is critical to this scenario, because the otherwise unqu
alified word "penicillin" in the thousands of pages of a patient's entire medi
cal history could not trigger the recognition that the same word inside an <al
lergies> element could trigger.

The health care example is relevant not only because of the scope of the probl
em and the enormous sums of money involved but also because it is paradigmatic
of a very wide range of future Web applications -- any in which Web clients (
or Java applications running on those clients) are expected to mediate the los
sless exchange of complex data between systems that use different forms of dat
a representation in a way that can be standardized across an industry or other
interest group. Some random examples of such applications are:

Legal publishing
The government drug approval process
Collaborative CAD/CAM efforts
Collaborative calendar management across different systems
Any corporate network application that works across databases, especially wher
e policies must be enforced: purchase orders, expense requests, etc.
Exchange of information between players in any broker-organized business: insu
rance, securities, banking, etc.
Distributed processing: giving Java something to do
A paradigmatic example of this second category of XML applications is the data
delivery system designed by the semiconductor industry.

Each major semiconductor manufacturer maintains several terabytes of technical
data on all of the ICs that it produces. To enable interchange of this data,
an industry consortium (the Pinnacles Group) was formed several years ago by I
ntel, National Semiconductor, Philips, Texas Instruments, and Hitachi to desig
n an industry-specific SGML markup language. The consortium finished that spec
ification in 1995, and its member companies are now well into the implementati
on phase of the process.

One might think that the rise in popularity of HTML would cause the Pinnacles
members to reconsider their decision, but in fact the limitations of HTML have
convinced them that their original strategy was the correct one. Their initia
l idea was that the richly parameterized data stream made possible by the indu
stry-specific SGML markup would enable intelligent applications not merely to
display semiconductor data sheets as readable documents but actually to drive
design processes. It is now recognized that this approach is a perfect fit wit
h the concept of distributed Java applets, and the vision of the near future i
s one in which engineers can access a manufacturer's Web site and download not
only viewable data on particular integrated circuits but also a Java applet t
hat allows them to model those circuits in various combinations.

The semiconductor application is a good demonstration of the advantages of XML
because:

It requires industry-specific markup that cannot be implemented within the con
fines of the fixed HTML tag set.
It requires that the data representation be platform- and vendor-independent s
o that data from a variety of sources can be used to drive a variety of distri
buted applications (some of which may be provided by third parties, generating
a subindustry of providers of tools that can work with the standardized data
stream).
Its utility rests ultimately in the fact that a computation-intensive process
(modeling circuits for hours at a time) that would otherwise entail an enormou
s, extended resource hit on the server has been changed into a brief interacti
on with the server followed by an extended interaction with the user's own Web
client. This aspect has been summed up in the slogan "XML gives Java somethin
g to do."
Note that validation, while sometimes important, does not always play the cruc
ial role in this category of applications that it does in applications where d
ata must be checked for structural integrity before entering a database. To ma
ke processing as efficient as possible, XML has been designed so that validati
on is optional in applications where it is not needed.

As with the health-care example, the semiconductor application is notable not
merely for the sheer size of the market it represents but also because it is p
aradigmatic of an enormous range of future Java-based Web applications -- virt
ually any application in which standardized data is expected to be manipulated
in interesting ways on the client. Perhaps the most obvious examples of such
applications are the following:

Design applications where the designer would otherwise use server cycles to co
nsider various alternatives: electronics, engineering, architecture, menu plan
ning, etc.
Scheduling applications where a customer would otherwise use server cycles to
entertain various possibilities: airlines, trains, buses, and subways; restaur
ants, movies, plays, and concerts. This is what Easy Saabre and Ticketron will
look like a few years from now as the economies of distributed Java-based pro
cessing become evident.
Commercial applications that allow consumers to explore alternatives by supply
ing different shopping criteria: real estate, automobiles, appliances, etc.
The entire spectrum of educational applications, a small subset of which are t
he ones we call "online help".
The entire spectrum of customer-support applications, ranging from lawn-mower
maintenance through technical support for computers.
A harbinger of applications to come in the last category is the Solution Excha
nge Standard, an SGML markup language announced last June by a consortium of o
ver 60 hardware, software, and communications companies to facilitate the exch
ange of technical support information among vendors, system integrators, and c
orporate help desks. In the words of the announcement:

The standard has been designed to be flexible. It is independent of any platfo
rm, vendor or application, so it can be used to exchange solution information
without regard to the system it is coming from or going to. [...] Additionally
, the standard has been designed to have a long lifetime. SGML offers room for
growth and extensibility, so the standard can easily accommodate rapidly chan
ging support environments.
Such applications, which the XML subset is specifically designed to address, w
ill grow in importance as consumers come to expect interoperability among thei
r data-manipulating applets and information providers confront the realities o
f trying to support computation-intensive tasks directly on their Web servers.

View selection: letting the user decide
A third variety of XML applications are those in which users may wish to switc
h between different views of the data without requiring that the data be downl
oaded again in a different form from the Web server.

One early application in this category will be dynamic tables of contents. It
is possible now, using Web servers built on object-oriented databases, to pres
ent the user with a table of contents into a large collection of data that can
be expanded with a mouse click to "open up" a portion of the TOC and reveal m
ore detailed levels of the document structure. Dynamic TOCs of this kind can b
e generated at run time directly from the hierarchical structure of the docume
nt. Unfortunately, the Web latency built into every expansion or contraction o
f the TOC makes this process sluggish in many user environments. A much better
solution is to download the entire structured TOC to the client rather than j
ust individual server-generated views of the TOC. Then the user can expand, co
ntract, and move about in the TOC supported by a much faster process running d
irectly on the client.

A group at Sun actually implemented a form of this solution as part of a Java-
based HTML help browser, but the limitations of HTML required the team to come
up with a couple of clever workarounds. In this application, a TOC was constr
ucted by hand (the lack of structure in ordinary HTML makes it impossible to r
eliably generate a TOC directly from the document) using nonstandard tags inve
nted for the purpose, and then the TOC piece was wrapped in a comment within a
n HTML page to hide the nonstandard markup from Web browsers. A Java applet do
wnloaded with the HTML document interpreted the hidden markup and provided the
client-based TOC behavior.

In practice, this application worked very well and testified both to the ingen
uity of its designers and to the validity of the basic concept. But in an XML
environment, neither the manual creation of the TOC nor its concealment would
have been necessary. Instead, standard XML editors would have been used to cre
ate structured content from which a structured TOC could be generated at run t
ime and downloaded to browsers that would automatically create and display the
TOC using either a downloaded Java applet or a standard set of JavaHelp class
libraries.

The ability to capture and transmit semantic and structural data made possible
by XML greatly expands the range of possibilities for client-side manipulatio
n of the way data appears to the user. For example:

A technical manual that covers both the Sparc and x86 versions of the Solaris
operating system can be made to appear like a manual for Sparc only, or a manu
al for x86 only, just by clicking a preferences switch.
An installation sheet that carries warnings in multiple languages can be made
to show just the ones in the language selected by the user.
A document containing many annotations can be switched from a mode that shows
only the text, to a mode that shows only the annotations, to a mode that shows
both, just by making a menu selection.
A phone book sorted by last name can instantly be changed into a phone book so
rted by first name.
This list only hints at the possible uses that creative Web designers will fin
d for richly structured data delivered in a standardized way to Web clients.

Web agents: data that knows about me
A future domain for XML applications will arise when intelligent Web agents be
gin to make larger demands for structured data than can easily be conveyed by
HTML. Perhaps the earliest applications in this category will be those in whic
h user preferences must be represented in a standard way to mass media provide
rs. The key requirements for such applications have been summed up by Matthew
Fuchs of Disney Imagineering: "Information needs to know about itself, and inf
ormation needs to know about me."

Consider a personalized TV guide for the fabled 500-channel cable TV system. A
personalized TV guide that works across the entire spectrum of possible provi
ders requires not only that the user's preferences and other characteristics (
educational level, interest, profession, age, visual acuity) be specified in a
standard, vendor-independent manner -- obviously a job for an industry-standa
rd markup system -- but also that the programs themselves be described in a wa
y that allows agents to intelligently select the ones most likely to be of int
erest to the user. This second requirement can be met only by a standardized s
ystem that uses many specialized tags to convey specific attributes of a parti
cular program offering (subject category, audience category, leading actors, l
ength, date made, critical rating, specialized content, language, etc.). Exact
ly the same requirements would apply to customized newspapers and many other a
pplications in which information selection is tailored to the indvidual user.

While such applications still lie over the horizon, it is obvious that they wi
ll play an increasingly important role in our lives and that their implementat
ion will require XML-like data in order to function interoperably and thereby
allow intelligent Web agents to compete effectively in an open market.

Advanced linking and stylesheet mechanisms
Outside XML as such, but an integral part of the W3C SGML effort, are powerful
linking and stylesheet mechanisms that go beyond current HTML-based methods j
ust as XML goes beyond HTML.

Linking
Despite its name and all of the publicity that has surrounded HTML, this so-ca
lled "hypertext markup language" actually implements just a tiny amount of the
functionality that has historically been associated with the concept of hyper
text systems. Only the simplest form of linking is supported -- unidirectional
links to hardcoded locations. This is a far cry from the systems that were bu
ilt and proven during the 1970s and 1980s.

In a true hypertext system of the kind envisioned for the XML effort, there wi
ll be standardized syntax for all of the classic hypertext linking mechanisms:

Location-independent naming
Bidirectional links
Links that can be specified and managed outside of documents to which they app
ly
N-ary hyperlinks (e.g., rings, multiple windows)
Aggregate links (multiple sources)
Transclusion (the link target document appears to be part of the link source d
ocument)
Attributes on links (link types)
The first draft of a specification for basic standardized hypertext mechanisms
to be used in conjunction with XML is scheduled for release at the Sixth Worl
d Wide Web Conference in April, 1997.

Stylesheets
The current CSS (cascading style sheets) effort provides a style mechanism wel
l suited to the relatively low-level demands of HTML but incapable of supporti
ng the greatly expanded range of rendering techniques made possible by extensi
ble structured markup. The counterpart to XML is a stylesheet programming lang
uage that is:

Freely extensible so that stylesheet designers can define an unlimited number
of treatments for an unlimited variety of tags.
Turing-complete so that stylesheet designers can arbitrarily extend the availa
ble procedures.
Based on a standard syntax to minimize the learning curve.
Able to address the entire tree structure of an XML document in structural ter
ms, so that context relationships between elements in a document can be expres
sed to any level of complexity.
Completely internationalized so that left-to-right, right-to-left, and top-to-
bottom scripts can all be dealt with, even if mixed in a single document.
Provided with a sophisticated rendering model that allows the specification of
professional page layout features such as multiple column sets, rotated text
areas, and float zones.
Defined in a way that allows partial rendering in order to enable efficient de
livery of documents over the Web.
Such a language already exists in a new international standard called the Docu
ment Style Semantics and Specification Language (DSSSL, ISO/IEC 10179). Publis
hed in April, 1996, DSSSL is the stylesheet language of the future for XML doc
uments. An initial specification of a DSSSL subset [3] for use with XML applic
ations has already been published. This specification will be further develope
d as part of the XML activity.

Conclusion
HTML functions well as a markup for the publication of simple documents and as
a transportation envelope for downloadable scripts. However, the need to supp
ort the much greater information requirements of standardized Java application
s will necessitate the development of a standard, extensible, structured langu
age and similarly expanded linking and stylesheet mechanisms. The W3C SGML eff
ort is actively developing a set of specifications that will allow these objec
tives to be met within an open standards environment.

Acknowledgements
The author would like to thank his colleagues in the Davenport Group for early
contributions to the beginnings of this document. The example applications we
re clarified and expanded with the help of participants in the workshop "Inter
net Applications of SGML and DSSSL" held at the GCA Information and Technology
Week in Seattle on August 23, 1996. Special thanks are due to Tim Bray, Kurt
Conrad, Steve DeRose, Matt Fuchs, and Murray Maloney for their outstanding con
tributions to the workshop.

Production note
This paper was written in HTML 3.2 and formatted by the Jade DSSSL engine [4]
for printout. The section numbers, headers, footers, and Table of Contents see
n in the printed version are not part of the HTML source [5] but were generate
d automatically as specified by a DSSSL stylesheet [6].

References
[0] http://sunsite.unc.edu/pub/sun-info/standards/xml/why/xmlapps.ps.zip
[1] http://www.w3.org/pub/WWW/MarkUp/SGML/Activity
[2] http://www.w3.org/pub/WWW/TR/WD-xml-961114.html
[3] http://sunsite.unc.edu/pub/sun-info/standards/dsssl/dssslo/dssslo.htm
[4] http://www.jclark.com/jade/
[5] http://sunsite.unc.edu/pub/sun-info/standa

--
※ 来源:．哈工大紫丁香WWW bbs.hit.edu.cn. [FROM: 203.123.8.6]