How XML improves the Library System

At the heart of every librarian is the desire to organize information so that it is accessible; a library without a cataloging system is just a pile of books.  No matter how eclectic or well preserved, a collection of books is only as useful as its cataloging system, and the same goes for the wide array of digital resources available on the internet.  When a code is provided to unlock the details of the collection, the information in that particular library is opened up to researchers.   A single code for the organization of available information is the ultimate quest for today’s librarians, and the code that works most effectively in our present ‘digital’ system is XML.

The challenge before librarians today is to take the seemingly unknowable expanse of electronic resources, and catalogue it so that it is available to users searching from many different portals, in different languages and in different platforms.   In this present challenge, XML is the best option for organization and delivery of information. Once translated and catalogued under XML standards, the bibliographic information of any scanned text or object is comprehensive, with more depth and cataloguing detail than ever before. This in turn affects library applications such as inter library loans and the storing and accessing of digital libraries or archives.

XML improves library services primarily by improving bibliographic cataloguing.  This original translation of information into XML data paves the way for improvement in other applications in the library system.  Electronic resources such as online archives of texts and non-textual objects, resources in different languages and resources catalogued in different formats are more  accessible when they have been catalogued in the most interoperable ‘language’ that is  available. Once the original bibliographic data is in XML, inter-library loans are improved by increased availability of resources, and more scanned information can be stored and accessed in digital archives.

XML (eXtensible Markup Language) is a widely used standard on the internet:  it is a structured, flexible and interoperable metadata schema that is ideal for storing bibliographic information. XML is not actually a language, which by definition is fixed, but a meta language, and in this way XML is more like MARC (Machine Readable Cataloguing Records) than the other well-known language of the internet, html.   While “html’ formats show how data is displayed and what it looks like, XML, like MARC, describes the content of data elements and structures the data so that it is portable between different systems.  Both XML and MARC were designed to store and transport data, and they both define and validate information so that it can be shared.

In the arena of bibliographic cataloguing, XML has long been acknowledged as an improvement to the traditional MARC (Machine Readable Cataloguing Records), a system that began with index cards.   MARC cataloguing has been an effective system up until the advent of the digital age, but the traditional cataloguing system has been challenged by the exponential growth of electronic information.  The breadth of different languages and types of information that need to be catalogued has challenged the old system and although librarians have been inventive in their attempts to create bridges between different formats, they have inevitably created every cataloguers dreaded sin, “redundant work”  and      a ”de facto dual system” of information gathering and dispersal (Miller, 2000, Slide 3).

In an enlightening PowerPoint presentation in 2000, Dick R. Miller makes the argument that XML is the best schema for bibliographic cataloguing, the blood and guts of any interlibrary system. XML is a “meta-language, permitting the definition of an unlimited number of specific markup languages, each of which may contain an unlimited number of tags, hence extensible”, explains Miller, maintaining that the most significant aspect of XML is its separation of content, presentation and linking (Miller, 2000, Slide 4).  Miller argues that XML is well suited for bibliographic data because of its ability to identify complex data structures. The Unicode language, used by XML, is excellent for libraries as it allows diacritics, special characters, and non- Roman data to be handled like ordinary text.

The traditional MARC cataloguing format was flat and had limited support for hierarchy, writes Miller, “Whereas XML is inherently hierarchical”, (Miller, 2000, Slide 24), allowing for more information to be stored and accessed.  Miller explains that the XML schema is more flexible due to its hierarchical style, and because different components of information could be ‘divided and conquered” (Miller, 2000, Slide 11).  In his writing Miller argues that the XML is essential to the library system, without it the libraries resources would be relegated to ‘’dark data” and the inter-library systems would be “under-utilized due to its segregation from mainstream web resources, and in danger of being marginalized” (Miller, 2000, Slide 4).

In his born digital article entitled “Moving from MARC to XML”, K.T. Lam writes that the new XML language promises to make the “Web smarter by allowing Web pages to carry not just the layout, but the semantic structure of its content” (Lam, 1997, Part 1).  Lam also emphasizes that bibliographic data is an area where XML would improve data collection and control; he says “We can create bibliographic records once and publish them in different formats; bibliographic records can (will) be directly viewed by the Web browsers, search engines, and potentially library systems without the need of further conversion and bibliographic records be interchanged between XML and MARC without any data loss” (Lam, 1997, Part 1).

Once XML has improved the highways on which the information moves, the roads traveled can carry more information to and from more places.  Inter-library loans are an application that is improved by the use of XML: if one continues with the transportation metaphor, interlibrary loans can be seen as a well-run bus system that is dependent on the quality of the roads and the vehicles to move information from a resource to a portal.

Inter-library loans have been, and still are, in some areas, a physically time consuming task of transporting texts by mail. But the digital age has sped up the process.  A scanned text can arrive in a users’ computer (smart phone, PDA, etcetera) in a matter of minutes, once the one-time labor of scanning the text has been done.  With XML as the standard, librarians aim to have bibliographic details catalogued in a one-time procedure as well, with no translation or transferring between programs further down the road.

An inter-library loan system that can access the data banks of electronic resources programmed in different languages and in different codes is essential now that data can be searched and retrieved from so many sources.  And once the information is retrieved it is just as important that the information be easy to download into different portals.  Kyle Banerjee argues that the flexibility of XML improves the accessibility of information, “XML is particularly useful for presenting the same information for different users, since a style sheet can be used to format a Web-based news service for a businessperson with a wireless palmtop computer, a blind member of the computer with a talking computer, or a college student in a computer lab” (Banerjee, 2002, Hype vs. Reality).

Banerjee argues that XML is a tool that has been used for some time already, “For years, libraries have been quietly using XML to perform functions such as improving access to archival materials, simplifying interlibrary loans processing, and enhancing digital collections, but increased reliance on the internet for delivering information resources has brought XML into the mainstream, where its impact is starting to be felt by libraries of all sizes” (Banerjee, 2002, ‘Practical Applications’). Banerjee anticipates the need for XML in the organization of library data; he knows that libraries do not need to compete with the ‘internet’, only add their voice to the choir that is out there.

 

While the basic concept of organizing and cataloguing data so that it is easily accessed and delivered is unchanged, the actual form and style of libraries as we have known them, is changing.  Once upon a time a library was a building with books on shelves, and the librarian was the access and the authority. Invariably a ‘she’ in our collective memories, the librarian helped the student navigate the codes and translate the data.  Today the experience of a research student is more like a sole adventurer without a compass in the vast expanse of the frontier World Wide Web.  Fortunately, the flexibility of the XML metadata schema allows programmers to make the search for digital data as simple as possible, as well as offer tools to translate and confirm the findings along the way.

One of the harsh realities of electronic resources is the fact that they do not sit still, like a book on a shelf.  An electronic resource is like a moving target, the URL shifting from one ‘location’ to another and the question of authority shifting as well.  An electronic article can have a limited shelf life under the propriety powers of a limited copyright deal, and then move on like a mercenary soldier, to the next owner.  Recognizing the difficulty of hunting down stable URL’s, JSTOR has implemented ‘deep linking’ to make electronic resources consistently available to their users, “As the scholarly community gravitates toward the use of the Web as its primary medium of communication, and as more and more scholarly resources are made available electronically, it is evident that students, faculty, and researchers will seek information through an almost unlimited number of avenues”(JSTOR NEWS, 2001, “Deep Linking”).

XML and related standards like DTD and SXLT, have a hierarchical system of cataloguing that helps catalogers divide the information into more distinct fields. A broadly recognized standard allows more access to electronic resources that have been catalogued in different forms and languages. And XML allows for more flexibility when cataloguing non-text objects, a more common occurrence with the rising popularity of digital archives. Posters, letters, photographs and maps are now included in library resources and must be clearly identified in the catalogue system.  Building and accessing digital libraries that are flexible and accessible to all researchers is an essential role for libraries, not only for texts that may be deteriorating, but for non-text items that would not be available for study without cataloguing. Even websites, blogs and videos are part of the mix, and XML is the ‘go to ‘schema for incorporating all the multiple and various resources.

The ‘Legacy Tobacco Documents Library’ is an interesting example of the challenges faced by librarians who are creating and maintaining digital archives in XML.  Heidi Schmidt, a librarian on the project, explains that “The Legacy Tobacco Documents Library” began back in 1993 when the University of California was given an anonymous donation of documents from the files of Brown and Williamson Tobacco Company.  There was some legal maneuvering in which the University of California refused to bow to the tobacco company, and ultimately the documents were scanned by the Library and Center for Knowledge Management and made available to the internet on the World Wide Web by 1995.

Schmidt describes how the original documents were kept on CD and “were created when no one expected the tobacco documents to grow to its present size and importance”(Schmidt, 2002).  The scanned texts at the beginning of the project, for example, were of a low quality and hard to read. However, the project grew in size and importance, with more documents being added and more users actively searching the documents over the years.

Cataloguing challenges were plentiful, writes Schmidt, explaining that “…the library was committed to maintaining the integrity of the original data for archival purposes. In creating the XML that is indexed by XPAT and used for searching, however, several small changes were made to enhance access to the documents through cross—collection searching. For example, users often search for names mentioned in documents, but some of the names in the collection records had X’s in front of the names, such as XXMARY. Since the search engines did not search for strings and only truncates the end of a word, users would never find these documents in a search. Rather than take out the X’s a program looked for all such occurrences and added the name without the extra characters. In another case, dates were normalized in the XML to allow efficient searching of the document sets” (Schmidt, 2002). The “Legacy Tobacco Documents Library” is a prime  example of a digital archive that not only exemplifies the value of scanning and cataloguing important historical documents so that they can easily be accessed, but shows how the flexibility of the XML metadata schema is necessary to provide pathways and solutions to complicated cataloguing challenges.

The missing element of these digital collections is the librarian with her glasses siding down her nose, and the answers to your questions.  The evolving digital library is recognizing this missing element and has begun to add new features to their collection such as ‘value added’ tools and applications that accompany the scanned documents and help the researcher to translate or decipher the information that they have uncovered. The Perseus Digital Library at Tufts University is a great example of the evolution of the digital archive, containing not only scanned artifact, and texts in Greek, Roman, Arabic and German and Latin, but also programs for identifying the objects or translating the languages, exemplifying how a digital library can be more of an ‘interactive space’ than just a collection of texts and objects.

Gregory R. Crane, the Editor-in-Chief, writes in the introduction to the site that during the transition from a CD-ROM based collection to an internet site, the “nature and scope of Perseus demanded a flexible, extensible and powerful data management system. It was not enough for Perseus to simply offer the resources to the user, they wanted to facilitate the consumption of the information”  (Crane, 2011, Background and Purpose).   Crane explains that they needed a system that would allow for “an interoperable, modular and open-source digital archive to be created and used which included tools to enhance the consumption of the information such as automatic linking, information extraction and visualization services. that existing, largely catalog-oriented systems could not support (Crane, 2011, Background and Purpose).

Digital libraries are the libraries of the future, and XML and its attendant ‘languages’ are flexible and portable enough to offer not only offer cataloguing solutions but  incorporate interactive and open-source applications.  XML and SXTL are the most recent developments in a move to make information exchange systems interoperable and universal. The ultimate goal of every librarian, or ‘information technicians’, is to create systems that allow for the most effective data entry and control, with the least amount of tinkering, and the easiest and fastest access.  Electronic resources are available from all over the world, and have been catalogued within many different systems.  XML and the XSTL language allow the different systems to speak to each other and make the information more accessible to everyone. XML can create flexible and comprehensive data organization that allows for efficient cataloging of the varied information of the digital age, as well as delivering it to different modalities.

Challenging cataloguing issues such as different languages and scripts, non-text objects as diverse as slides, art collections, and electronic resources that have shifting  non-stable URL’s  and  unclear  ‘authority’, could not be managed with the traditional MARC system and need the flexibility of the XML standard. Banerjee sums up the situation succinctly when he says; “The simplicity and flexibility of XML make it possible to integrate services and resources in ways that would have been impossible just a few years ago. Vendors, libraries, and open source programmers are all interested in finding ways to search many kinds of resources with a single query, and XML represents a major step forward in making this goal a reality” (Banerjee, 2002, Conclusion, Concusion, XML’s Future in Libraries).

Miller also argues that XML is the way of the future, saying that it is essential for an integrated library system to have a flexible interface capability in order to be on par with comparable web resources. Miller argues that a library’s resources could be left unused and in the dark if they do not move on to XML- based library systems that will “put libraries in the Web mainstream, and foster, rather than impede, our ability to provide new and improved user services in this exciting environment”  (Miller, 2000, Slide 45). The goal, according to Miller, is to allow data to “flow more freely in and out of library systems”, as well as make librarians more relevant in the Information Age, and allow “their eXpertise to be more broadly applied than currently” (Miller, 2000, Slide 45).

In her lively book review of Roy Tennant’s XML in Libraries written in 2003, Priscilla Caplan is unsure of the inevitability of XML and describes the debates at the time as “‘MARC is dead’ alarmism and silliness all around” (Caplan, 2003).  She questions the State of Tasmania’s statement that they needed to build an XML system from scratch rather than use the traditional or MARC-based systems in order to have “controlled and structured metadata (that was) simple to enter, easy to index, and flexible in terms of output and reuse possibilities” (Caplan, 2003). However, in the ultimate test of time, the need for a system that is “highly specific, yet consistent in terms of access points, vocabularies and subject terms” (Caplan, 2003) has been proven to be necessary for more than library systems.

Caplan’s view is that librarians, contrary to the ‘stereotypical view’,  have always been keen to experiment with new technologies, even taking on new technologies before they have been proven.  Caplan writes that XML, only five years old at the time, was increasingly becoming the way of the future. She lists the many ways that both governments and libraries had begun to use XML, which has only increased, ending with, “Truly, a thousand flowers are blooming” (Caplan, 2003).

Since the time of this book review, XML has become the new language of librarians, with MARC existing more like a revered ‘mother tongue’, its vocabulary redolent of a time now in the past. At the time of her article in 2003, Caplan was cautious about the success of a meta data system that still held ‘glitches’, but she was rightly proud of the lively, practical and uplifting articles about XML by her fellow librarians.  In the quest for ever better organization of data, with the ultimate aim to be universal access to knowledge, Caplan  gives voice to the odd but joyful character of librarians everywhere, saying that  librarians are “mindful of our ends, but we can certainly have fun with our means”.

In a slightly more sober testament to the future of XML in libraries, Banerjee writes in 2002 that XML is certain to increase in use within libraries, arguing that “The simplicity and flexibility of XML make it possible to integrate services and resources in ways that would have been impossible just a few years ago. Vendors, libraries and open source programmers are all interested in finding ways to search many kinds of resources with a single query, and XML represents a major step forward in making this goal a reality” (Banerjee, 2002, /XML’s Future in Libraries).  Banerjee’s only caveat is that XML is only a tool, and cannot bring information to the user on its own.  The librarians’ job is to make the tool as effective as possible.

Although the structure of information has changed, and the way it is accessed has changed, the primary desire of every librarian, to make the information accessible by an organized code, is the same. The tool that we have at our finger tips is XML, and related standards like DTD, XSLT or DOM, which allow information to flow more easily because of a standardized ‘grammar’ that builds hierarchically and separates content into distinct areas.  The library system is improved by the use of XML because cataloguing is at the heart of any library, and XML improves the cataloguing process by allowing for more information to be catalogued, more data systems to be read, and more stored information to be accessed by different electronic portals.

 

 

 

References

 

Banerjee, Kyle. (2002, September). How Does XML Help Libraries. Computers in Libraries,22.  Retrieved November 8, 2011 from  http://www.infotoday.com/cilmag/sep02/Banerjee.htm

 

Caplan, P. (2003, May). XML in Libraries.  D-Lib Magazine ,9. Retrieved November 10, 2011 from http://www.dlib.org/dlib/may03/05bookreview.html

 

Crane, G. (2011). Introduction to Perseus Digital Library. Retrieved  November, 19, 2011 from the Perseus Digital Library at Tufts University’s website: http://www.perseus.tufts.edu/hopper/opensource

 

JSTOR (June 2001) JSTOR and ‘Deep Linking’.  JSTOR , 5. Retrieved November 20, 2011 from

 

Lam, K.T. (2010) Moving from MARC to XML. Hong Kong University of Science and Technology.  Retrieved  November 4, 2011 from http://ihome.ust.hk/~lblkt/xml/marc2xml.html

 

Miller, D. (2000). XML and MARC: a choice or replacement? Retrieved November 8, 2011 from   Point Presentation, at the MARBI/CC:DA Joint Meeting on the  American Library Association website: Chicago, 2000. Retrieved November 8, 2011 from http://elane.stanford.edu/laneauth/ALAChicago2000.html

 

Schmidt, H. (2002) Building Digital Tobacco Industry Document Libraries at the University of California, San Francisco Library/Center for Knowledge Management. D-Lib Magazine, 8.  Retrieved November 9, 2011 from http://www.dlib.org/dlib/september02/schmidt/09schmidt.html