e-Publications

ICSTI Conference 2004

back to e-Publications

 

ICSTI CONFERENCE, LONDON , MAY 17, 2004

TECHNICAL AND ECONOMIC CHALLENGES OF SCIENTIFIC INFORMATION:
STM CONTENT ACCESS, LINKING AND ARCHIVING

Reported by Dr Wendy A Warr, Wendy Warr & Associates, 6 Berwick Court, Holmes Chapel, Cheshire CW4 7HZ, England. Tel/fax +44 (0)1477 533837, wendy@warr.com http://www.warr.com

SESSION 1. OPEN ACCESS

Moderator's Comments

Sally Morris, CEO, Association of Learned and Professional Society Publishers, ALPSP

Open Access (OA) can be achieved in two ways: either author self-archiving of their individual papers in parallel with publication in traditional subscription-based journals, or the conversion of journals themselves to a free-to-access business model, where costs are covered by payment on behalf of the author rather than on behalf of the reader. This session focused primarily on the second approach.

What do we mean by “open”? This meeting was “open” but not free. One definition of OA is the Berlin-Bethesda-Budapest one. Another is delayed OA: partial OA at the publisher or author request. Steve Harnad's definition is at one end of the OA spectrum. Who pays: the consumer, the library, the creator or a fairy godmother?

What is the problem? Is there is a mismatch between funding and research output? Is OA intended to be a leveller of publishers' profits. Is it a question of access to all readers? How do we achieve social good? More effective use of research can also be a driver and we need to ask “Who needs OA?”. ALPSP, HighWire Press and AAAS are carrying out research on all this.

Open Access: Sustainable Business Models and Ethical Imperatives

Hugh Look, Senior Consultant, Rightscom Ltd.

Is journal publishing a sustainable business? Look made some assumptions: almost all journals will be electronic; complex multimedia will be increasingly important; even under OA models there will be a role for a small number of paid-for premium titles; secondary publishing may also remain paid-for; and societies are not immune. Publishing is only one aspect of a wider spectrum of communications; communications demand reciprocity and reciprocity demands reciprocity.

Someone, somewhere will make a profit (or surplus); the ethical issue is how much profit is made and who makes it. Profits are used to pay dividends to shareholders (affecting your pension fund), to invest in new development, and maybe to advance the aims of a society. Should we be looking at value-added rather than profit and asking publishers to justify the value-added used for paying editors, staff, and profits? The profits of publicly-quoted publishers are needed to sustain shareholder value but societies need their surpluses too and so do owners of non-quoted publishers. Disruption could destroy the share price of commercial publishers but there is no proven commercial OA model yet and someone needs to provide capital for development. All publishers are vulnerable, not just quoted ones. There is also the issue of corporate social responsibility.

Look described Christensen's “disruptive innovation” model. The incumbents in a demanding market provide high-quality products for that market. New entrants provide a much less complete product cheaper or faster. The incumbents ignore the new entrants and focus on refining the current offering. The new entrants build a small base that generates enough cash to improve the product while the incumbents make incremental improvements to retain customers. Before long the new entrants have assembled enough capital and customers to be a real threat to the incumbents. Whether this model will impact on publishers depends on the speed of transition: 90% of low-profit journals would become non-viable very quickly if profits reduced elsewhere. There are a few reasons why commercial publishers might stay in the market even if margins became unattractive but some would sell businesses or close down altogether.

OA economic models are currently weak although they could become disruptive. Perhaps OA is just the catalyst and not the future. Possible outcomes are stasis, as publishers respond to changing demands; revolution, if there is rapid transition and OA publishers target profitable “B-list” journals; evolution (a period of co-existence); or chaos. Chaos can be managed in a simplified value chain where content and context is created then co-ordinated, presented and delivered to produce an experience, not a product. Value chains do not work for OA: users are not built into the value process.

Content and context can be created at many points and value can be created and monetised at several points. The questions are where to extract financial value and how to relate to the stakeholders. Should you charge by value, bandwidth, time or satisfaction with the outcome (e.g., tenure or promotion)? Can users be sponsored rather than the content, if, for example, an IT supplier gives articles away free, or in the way that banks attract students? Think about reciprocity: what does the user get back out of the process? The price of disruption could be very high: entire industries do die.

Open Access - Who Pays the Piper?

Robert Campbell, President, Blackwell Publishing

A system funded by the author will tend to benefit the author; a system funded by the publisher will tend to benefit the publisher. A survey by Blackwell's has shown that authors choose a journal for status, readership and impact factor; speed of publication matters but the emphasis is on peer review. A survey by the Centre for Information Behaviour and Evaluation of Research (CIIBER) found that authors select a journal that targets their research colleagues and also has a reputation for quality and integrity from peer review. About 82% of respondents knew little or nothing about OA. Opinions about OA were positive but there were reservations over quality and longevity and a feeling that papers might become longer and more abundant as market power shifted from reader to author. There was a lack of understanding of what publishers do but 76% of respondents felt that they had better access to journals than 5 years ago. There was great resistance to author payment.

A journal achieves status and recognition through its perceived readership, peer review, respected editors, society association, and role in the subject community. It is the title that matters not the publisher. Pay-to-publish journals need to offer authors these features plus author-friendly services. Pleasing authors could be the downfall of such journals if there is rapid, open peer review, if the publisher offers low charges but cut-down services, and if it is perceived that well-funded research groups are better able to get work published in higher cost OA journals.

Authors will not expect to pay, whether the economic model is a subscription one or a pay-to-publish one. Funding councils and institutions may be prepared to pay. Authors might do more peer reviewing. Unfortunately there are issues of patronage (where a Head of Department has too much control) and of waivers for authors in the Third World , for example, who cannot afford even $200. If a journal changes to the pay-to-publish model it may lose or gain status but if it loses status, the subscribers it loses will never be recovered. This is a one-way experiment .

Taking a Leaf out of Houdini's Book

Jan Velterop, Publisher, BioMed Central

The paper was sub-titled “How to Escape from our Shackles”. In the old model of publishing, a manuscript was copyrighted, locked into a journal, and accessed by a key bought with money. In the new model an author pays to put a manuscript into a journal which is not locked, but open to all. OA is the logical consequence of the Internet and the possibility of universal access it offers. OA means unlimited access, re-distribution, and re-use. STM publishers (commercial as well as not-for-profit) must form an alliance with science, not stand in defiance of science. BioMed Central and other OA publishers are not driving OA but are responding to the needs of science.

There are those who say that the old model need not be fixed since it is not broken; it has evolved over centuries, but so has the horse-drawn carriage and you do not see them too often on the streets any more. The science of old is different from science now. It is borne by the Internet not carrier pigeons; the individualist model has been replaced by a collective, collaborative one and research is data-intensive. The old model is broken.

Seeing the need for OA reduces the problem to finding economic and practical solutions. The shackles are the myths about OA, such as:

  1. The cost of providing OA will reduce the availability of funding for research.
  2. Access is not a problem; virtually all UK researchers have the access they need.
  3. The public can get any article they want from the public library via interlibrary loan.
  4. Patients would be confused if they were to have free access to the peer-reviewed medical literature on the Web.
  5. It is not fair that industry will benefit from OA.
  6. OA threatens scientific integrity due to a conflict of interest resulting from charging authors.
  7. Poor countries already have free access to the biomedical literature.
  8. Traditionally published content is more accessible than Open Access content as it is available in printed form.
  9. A high quality journal such as Nature would need to charge authors £10,000-£30,000 in order to move to an OA model.
  10. Publishers need to make huge profits in order to fund innovation.
  11. Publishers need to take copyright to protect the integrity of scientific articles.
  12. The archive is not secure with OA publishing.

Velterop singled out myth number 6 for special attention. This is often interpreted as “Open Access publishers are likely to accept more papers, because they get more money that way” but it could equally be interpreted as “publishers are likely to accept more papers, because they get more money that way”. In a message to serialst@list.uvm.edu on March 10, 2004 , Sally Morris said that publishers would consider their 2005 prices bearing in mind not just cost increases but also the additional pages published if the submission rate of high-quality papers goes up.

Actually, the myths represent progress because just a few years ago, open access was completely dismissed as ridiculous, hare-brained, and cranky but, now, the conventional publishers feel the urgent need to discredit open access. They call the OA public relations message “horribly good”. This is nice to hear, but it is hardly true. It is the audience that is so receptive to the message that makes the PR so effective.

OA entails reducing dogma to pragma. Richard Horton, Editor of The Lancet has said : “The long-term goal for any editor of a primary research medical journal is to strengthen the culture of scientific inquiry and to improve human health. These are the ultimate yardsticks by which readers, authors, funding agencies, librarians, and publishers should judge the success of journals. In the sometimes divisive debate about open access, let us not lose sight of the fact that the publishing model is simply a means to a much greater end, an end that has far too long been neglected.”

“Strengthen the culture of scientific inquiry” is dogma, while “the publishing model is simply a means to a much greater end” is pragma. Velterop raised a few practical questions. Who pays, and from which budget, and how can we make the transition? The institute pays, from the research budget. This leaves the question of payment mechanisms in the transition.

A typical reaction to the concept of institutions paying article processing charges on behalf of their researchers is “Our library budget cannot deal with the open-ended-ness of paying article charges”. Would the pivotal role of coordinating money streams be the logical one for libraries and librarians? Should the fact that there are, as yet, no payment coordination mechanisms impede the development of OA, or should OA and its benefits be an incentive to devise those mechanisms?

Suppose that a journal costs €100,000 to run and it carries 100 articles at a cost of €1000 each. It can sell 1000 subscriptions at €100 each to individuals who do not share; or it can sell 500 subscriptions at €200 each to a library which shares; or it can sell 100 subscriptions at €1000 each to a consortium for sharing; or it can sell one subscription for €100,000 for the world to share; or it can sell 100 articles paid for at €1000 each, on open access. Velterop showed an image of a volcano. Pressure builds up inside it, the magna shrinks and leaks, and a cavity is formed. A cavity is dangerous: the volcano erupts and a crater is left at the top.

Open Access - One More Challenge, One More Opportunity

Mark Furneaux, Managing Director European Operations, CSA Europe

Cambridge Scientific Abstracts is a privately owned 35 year old company publishing bibliographic databases and abstracts journals. Its selection criteria are the subject scope of the database and the quality of articles. In choosing to abstract a serial it looks at the quality of the serial, availability of the source, capability of translation, end user requests, publisher relations, and cost-effectiveness of inclusion. The objective is to facilitate the identification of the original article and access to it. The article usually requires purchase.

Does OA change anything? The Budapest Open Access initiative definition states: “By ‘open access' we mean its free availability on the public Internet, permitting any users to read, download, copy, distribute, print, search, or to link to the full text of these articles, crawl them for indexing, pass them as data to software, or to use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the Internet itself”.

There is currently an inquiry by the UK Science and Technology Committee into Scientific Publications. More than 20 submissions can be read at http://www.biomedcentral.com/openaccess/inquiry/. The CILIP submission says that there should be “no discrimination by the Research Assessment Exercise for or against open-access journals. The deciding factor must be quality of the journal and the quality of the article”. Furneaux emphasised the last sentence. The deciding issue for CSA is quality. OA journals are another source, more important in some disciplines than in others. Numerous new OA journals require evaluation and perhaps inclusion which has a considerable impact on resources.

Some OA journals are peer reviewed but will a low rejection rate mean low quality? Scientists want recognition and the best of them will publish in peer reviewed serials. However, OA serials are widely read and may be a great opportunity for grey literature. Challenges of OA are a sustainable economic model, permanent archiving, broken links and the need to update links, identifying the final version, and quality. Benefits of OA are wider readership, the availability of full text, a decreased cost of acquisition for A&I services, and the fact that they are already in electronic format

Secondary Publishers are “below primary publishers in the food chain” and uncertainties with serials have an impact on them. OA also has an impact on society publishers who rely on publishing to fund other activities. Fragmentation of content makes the role of secondary publishers more difficult but more important: by Web indexing of content these publishers give value-added retrieval of select, relevant high quality material from a mountain of literature. Abstracting and indexing (A&I) organisations can play a quality control role in a world where there is more information, but people have less time to read it, and quality may be declining or more variable.

CSA was an early CD innovator and was very early in Web registration ( http://www.csa.com ). In keeping with this pioneering spirit, CSA will be launching its own OA journal entitled Sustainability Science: Practice and Policy as part of a project with the US National Biological Information Infrastructure. OA is widening readership but creating publishing uncertainties. Quality is the major issue, so A&I services have a key role in selecting and indexing relevant quality material. OA does open up new opportunities to disseminate some sources.

The Technology Behind Redistributing the Cost of Online Publishing

Geoffrey Bilder, Chief Technology Officer, Ingenta Ltd.

Ingenta has expertise in aggregation, metadata enhancement, bespoke Web site development , Web site management, hosting and maintenance, and access control. In the transition to OA, Bilder does not see much change in terms of access control, the presentation layer, reference resolution, statistics, the content store, content management, and metadata management and distribution. The technology that allows you to distribute the cost of an online publication fairly is the same whether the journal is subscription-based or OA.

Bilder talked about the concepts of trust, both horizontal and vertical, and local and global. Scholarly trust is vertical and global; Internet trust is local and horizontal. The publisher problem is that the value proposition is being questioned and there are accusations of profiteering. Content is comparatively hidden, brand increasingly hidden. Trust in intermediaries is depreciating. Spam, viruses, and bad metadata contribute to user distrust in the Internet. Yet eBay, Amazon, Slashdot and Google are success stories. The success is largely attributable to early adoption of simple trust metrics. These are based on stealth metadata and restricted to their own sites, without taking context into account. A good trust metric must preserve privacy, be convenient, be attack-proof, be distributed, be self-policing, include robust support for context, and work across services.

Is Open Access the Solution?

Charles Oppenheim, Professor of Information Science, Loughborough University

Many things are driving the OA movement. Academics and librarians are complaining about the current toll-access system. Governments around the world think that publicly funded research should lead to free of charge access. The Wellcome Foundation has taken an initiative. The UK House of Commons Select Committee is looking at OA.

So, what is Open Access? OA makes electronic copies available free of charge to anyone who can read them, by two routes, the open access journal (toll-free, i.e., without subscription charge) and the electronic repository (subject-based, or institutional). Steve Harnad is a well-known proponent of repositories. Both journals and repositories are searchable from remote locations.

OA has lots of advantages. Free exchange means a return to the core value of scholarship. The number of accesses for OA e-journals is higher than for toll-access e-journals and there is some evidence that OA articles get more citations than those in toll access journals. A moral and ethical argument is that everyone around the world can get access and there is also an impact argument: more eyeballs means greater spread of ideas. OA also cuts down costs for libraries.

However, OA also has disadvantages. Scholars as authors have concerns about peer review, cost, prestige, archiving, and information overload. There are copyright issues. Not everyone has access to the Web, especially in developing countries and OA merely shifts the costs from libraries to the funding agencies or employers.

Authors can refuse to sign a copyright assignment, avoid publishers who require assignment (see http://www.sherpa.ac.uk for a list of publishers who offer OA-friendly licences), use an OA journal instead of a traditional one, or use the “Oppenheim-Harnad solution” described at http://www.cogsci.soton.ac.uk/~harnad/Tp/resolution.htm#Harnad/Oppenheim . This “preprint-plus-corrigenda” strategy involves the following steps. Self-archive the pre-refereeing preprint. Submit the preprint for refereeing (revise etc.) At acceptance, try to fix the copyright transfer agreement to allow self-archiving. If this is successful, self-archive the refereed postprint; if it is not, archive the "corrigenda". This solution is unethical, even if it is not illegal. However, all these suggestions require some degree of self-confidence, which is no problem if the academic has a high reputation but is a problem if he is just starting out.

Not everyone has access to the Web. The idea of an increasing number of eyeballs falls down when so much of the world's population has no access to telephones, networks, PCs, and reliable power supplies. Even where there is access, the costs of the hardware and software may be considered too high. Is there any evidence of considerable un-met demand for electronic access to research output?

Shifting the costs is a zero sum game: someone has to pay for the system one way or another, so how does it get charged for, and who has to pay? In the author-pays model for OA journals, the costs per article to break even range from $500 to $2000; to make a profit, OA publishers need even more, maybe $5000. (Note that there is no reason why OA cannot be profit-making.) Does the publisher charge submission fees or publication fees? In the case of a submission fee, if the article gets rejected, the author has wasted his money. A publication fee is levied only if the article is accepted, but then successful authors are subsidising poor quality authors. There is no easy answer to the dilemma, but most journals go for a publication (acceptance) fee; a few go for a mixture of the two.

Who is going to pay for OA journals? If the funding agency does, the cost can be incorporated into the bid but then there may be fewer funding awards overall. If the employer, e.g., the University pays, what gets cut to pay for the OA costs? If the library pays, it is no better off than before. In rare cases the author himself may pay. Fees are often waived if an author pleads poverty, but then richer authors are subsidising poorer ones. Some subject areas, such as humanities, are not funded much by funding agencies and may not get OA journals, though there is no reason why institutional repositories should not be established in the humanities. It is not clear who will pay for institutional repositories, the library, computer services, central administration, or individual departments. The costs are uncertain at the moment but they are certainly not zero. What gets cut to pay for this?

Oppenheim is in favour of OA, but the ethical and commercial issues are still unclear. He believes that OA has been over-sold as a panacea to the serials crisis.

SESSION 2. EASILY ACCESSIBLE CONTENT AND LINKING

Moderators' Comments

Barry Mahon, Executive Director, ICSTI

With more material being published on more sites, the problems of identifying and linking materials become problematical. There is a great future for A&I services in this new world but how can they be sure that they are identifying where the material has been published, and having found it, that it is the latest version? This session deals with these issues.

Links Add Value to Research Publications

Tim Ingoldsby, Director of Business Development, American Institute of Physics (AIP)

Ingoldsby gave a brief history of linking at AIP:

1995 Links to bibliographic databases provided an abstract describing the cited article

1996 Links to source articles but only to articles from the same journal

1997 Links to source article from other journals but only if both journals were on the same online service

1998 Links to other databases (LANL preprint server, MEDLINE, etc.)

1999 Links to/from journals of other publishers located on other servers (using link resolvers)

1999 Links to/from value-added resources (ISI Web of Science, Chemical Abstracts' ChemPort)

2000 CrossRef central linking facility (using DOIs) is established

Scitation ( http://www.scitation.org ) is an hosting service for more than 100 journals from AIP ( http://www.aip.org ) and other science and engineering societies. Ingoldsby displayed a screen illustrating how links look today: the abstract from an article in the Journal of Chemical Physics was followed by links to additional information and full text, and a display of references from Web of Science, with links to the full text of these articles. He also gave bar charts breaking down the inbound links from services such as ISI, Chemical Abstracts, UI-UC/DLI, and National Laboratories, to all Scitation journals over recent months (about 110,000 of them in March 2004) and breaking down source article inbound links (from OJPS journals, APS Link Manager, CrossRef, etc.) to all Scitation journals (about 300,000 a month). About 34% of Scitation abstract or full text views come as the result of a link from “the outside” and 69% of outbound links go to another Scitation destination, 13.7% to CrossRef publishers and 10.8% to secondary database services.

Article links continue to add value. In 1996, the average article in Applied Physics Letters had 15 bibliographic links, 1 full text link and no citing article links. Those same articles in 1999 (before the implementation of CrossRef and back file additions) had 15 bibliographic links, 2 full text links, and 3 citing article links. The same articles today have 36 bibliographic links, 15 full text links, and 6 citing article links.

Ingoldsby next showed five screens from AIP's PhysicsFinder ( http://www.physicsfinder.org/ ), a product for searching topics, authors, and abstracts published in AIP and MAIK journals. First he demonstrated a fairly specific search for nanophotonic switching. (Google search was implemented in May 2003.) Clicking on the first hit gave the abstract (with link to full text) and references for a specific article in Applied Physics Letters. There are options to search for more articles from that issue of the journal, to exit to Scitation, to find more articles by the same authors or to search the Physics and Astronomy Classification Scheme (PACS).

PhysicsFinder is a success in terms of usage, and revenues, and technology. There were nearly 1,320,000 page views in March, 56% of them for AIP journals on Scitation. Article sales have increased 400%; the monthly sales revenue equates to five new institutional subscriptions. PhysicsFinder features are being added to Scitation.

The Role of Bibliographic Databases in the Resource Discovery and Linking Process

Andrea Powell, Product Development Director, CABI Publishing

A&I databases have a comprehensive index of all relevant material in a given subject area, with consistent indexing using controlled vocabulary and/or classification schemes, and standardisation of formats and terminology. They cover all document types, are easily accessible through many different platforms, and have enough information to link to the full-text. Powell produced statistics to show that the use of A&I databases, despite many predictions, is not going away. Los Alamos researchers use A&I databases 60% of the time to link to full-text. In a JSTOR survey, 37% of respondents start their search in an online database. ALPSP surveys of academic authors consistently rank inclusion of journals in A&I databases as “extremely important” when choosing where to publish research.

For linking to full-text online, CABI ( http://www.cabi-publishing.org ) captures DOIs wherever possible. OpenURL linking is used where DOIs are not available, or the content does not have a DOI, or for integration with Link Resolvers. Document URLs are included. CABI has bilateral arrangements with primary content providers (especially for linking via hosts) and other bespoke services to link to other content types such as patents and Genbank. There are links to the Web search engines Google and Scirus to expand resource discovery, and links to document delivery providers for paper-based supply. OAI-compliant metadata is harvested to include new types of Web-based content.

This is all very well in theory but in practice there are obstacles in the way of seamless linkage between secondary and primary content. DOIs can be found for only about 25% of CAB Abstracts records (this figure should be about 40% according to predictions form retrospective harvesting), not all of CABI's hosts activate DOIs, not everyone has a Link Resolver, URLs are not always permanent and lots of content is not online. CABI also has to ask itself the following questions. Which version of the full-text do we point to? How do we find out about new content? Some users complain about lack of links; does this affect our selection policy?

Linking to A&I databases was pioneered by PubMed, and has now been adopted by many secondary services such as CSA and CABI. A link to an abstract can fill a gap when there is no full text available but it can be confusing or irritating if the text is available. Linking is much harder to do if the database is not a free resource. Developments in the OpenURL make inward linking much more feasible.

A&I services of the future will continue to fill an important role for resource discovery as content proliferates but they need to adapt to new methods of scholarly communication, be involved in industry developments, be easy to use and be compatible with many end-user platforms. Future services will focus more on analysis and evaluation than simply recording the published research.

CrossRef: Virtual Integration for Scholarly Content

Ed Pentz, Executive Director, CrossRef

Pentz emphasised the need always to keep the end user in mind, minimising the time and effort needed to satisfy specific needs and requirements. Access to content is a matter of linking and searching. The challenge is to make connections in a distributed environment. The user wants to search and link regardless of publisher or location of content whereas the publisher wants visibility and traffic.

CrossRef is an independent membership association of scholarly publishers, operating a cross-publisher citation linking network using the DOI. It is one of nine official DOI registration agencies world-wide. Its mission is to provide services that bring the scholar to authoritative primary content, focusing on services that are best achieved through collective agreement by publishers. It has 307 participating publishers, 290 libraries and consortia, and 33 agents and affiliates (e.g., secondary publishers). It links 11.1 million items from 9,500 journals, carrying out 6 million DOI resolutions a month. More than 2.5 million DOIs are retrieved, and about 300,000 records updated, every month. There are DOIs for 650,000 books and proceedings. Backfiles are being digitised and being assigned DOIs, 2.4 million of them in 2003 and over 2 million expected in 2004. The oldest DOI is related to Volume 1, Issue 1, of The Lancet , dated 1823.

CrossRef's article network via reference links has reached critical mass. It will be implementing “forward linking” (providing “cited-by” links for articles) later in 2004 and is experimenting with a cross-publisher full text search, CrossRef Search. CrossRef Search is powered by Google. Nine publishers are involved in the pilot service of cross-disciplinary, full text search of journals and conference proceedings. A standard Google search is carried out, with results limited to authoritative scholarly content. The content is also available in regular Google searches. Publishers have CrossRef Search boxes on their normal search pages. The pilot is to run until the end of 2004 for evaluation of functionality, ranking, and end user feedback. DOIs are used for indexing articles and linking from search results back to the publisher. The service is optional for CrossRef members and participation in the pilot is to be expanded in 2004.

Pentz showed eight screens illustrating the capabilities of CrossRef search. The first was the search box at nature.com. He then showed the results of a Google search for “morpholino” at nature .com (from blackwell-synergy.com and nature.com) and the results from the same search on native Google (including hits from Stanford, PubMed and others). He showed a Nature full text display, including its DOI and he pointed out the DOIs in the references section. A DOI can be entered as an article locator in PLoS Biology or as a search term in Google.

Digital Identification - the New ISBN

Michael Carter, on behalf of Shane O'Neill, Managing Director, m Parliaments, Assemblies & Official Publishing Edvision TSO (The Stationery Office)

The explosion of Web activity is far exceeding the capacity of government infrastructure to manage the publication and life cycle of official information. What we are seeing is a gradual replication of appropriate publishing disciplines within the context of a multiple media world. The dissemination of official information across networks requires the adoption of an agreed system of persistent identification similar to the adoption of the ISBN system which revolutionised the e-trading of books nearly 40 years ago.

Digital identifiers, encompassing ISBNs for print and Web manifestations, will secure reliable linking, build traffic over time, tracking throughout the life cycle (from copyright deposit, and persistence during lifetime, to archiving) and provide assurance and reliability. The DOI provides assurance and reliability in versioning. Embedded in the user's desktop application and Web output, it is a permanent reference, which does not change but always references the latest version.

The Government sector has made progress in the adoption of DOIs. It is in the policy of the Office of the e-Envoy, the Official Publications of the European Community (OPOCE) will become a DOI agency and the Scottish Government has mandated DOI. TSO, the publisher of United Kingdom Official Publications , is the first DOI Registration Agency in Europe . It has an established portal ( http://www.tsoid.com ) and has commercial customers such as Photo Libra, which has bought 45,000 DOIs.

Full Text Linking at Loughborough University Library - a User Perspective

Chris Bigger, Academic Services Manager (Engineering), Loughborough University

Over the last three years, Loughborough University Library has introduced Aleph, Metalib and SFX. All three help to provide users with a seamless access route to physical library stock and electronic resources. Aleph provides an OPAC (Online Public Access Catalogue), Metalib provides a subject access route to databases and cross-searching of some databases, and SFX deals with linking to full text sources.

Today, about 180 databases and about 6500 e-journals are in the systems and the library wishes to make as much of this as possible electronic, with seamless links. This fits with user demands and Web culture but the danger is that if print is ignored, users could just scratch the surface and do poor research.

The library OPAC (from Ex Libris) gives details for all library stock and also links to e-journals. Metalib (also from Ex Libris) provides a one stop shop for all Loughborough's databases, Internet resources and CD-ROMs. Resources are arranged by Department. Cross searching of some databases is possible (although some publishers do not allow cross-search). There are links to all remote databases. Users can customise their own lists and e-journal listings will be handled soon. SFX gives links to the library catalogue and full text, from Metalib and many other databases. It helps with the catalogue search but is not an indication that the item is held in the library.

All databases can be accessed directly from within Metalib. Direct access is used for those that are not cross-searchable and for access to advanced features. SFX and other full text links work in some databases. Bigger showed how the bibliographic results output from a search of CSA had SFX icons for clicking to full text. He then showed related screens from Engineering Village 2 and Ebsco where the appearance of the links to full text is quite different.

This raises the issue of i nformation skills training. All students, staff and researchers have access to the full collection of databases and e-journals, on, or off campus. The library needs to show users how to think about their search, put a strategy together, carry out the search (on different interfaces), review results, re-run the search, set up CA, use PBS etc., and there is limited time.

A variety of routes is possible, including formal demonstrations, lectures, hands-on, information desks for enquiries, and one-to-one sessions. Excellent progress has been made and life is getting easier all the time as many more links to full text appear every day but there is a need to keep pushing, as users demand more, simple linking. Users sometimes do not understand why there are multiple databases which give access to a variety of journals. There are many different players. Systems are often not seamless, necessitating multiple searches. The library has to sort this out, or users will go away or carry out poor searches. With many suppliers promoting their interfaces and linking routes, the user organisation can be invisible to the user. With any linking technologies is there a chance that the user organisation can get its brand or image visible too?

Bigger appealed to the vendors for simple routes to full text with consistent terminology, and abolition of jargon, avoiding the need to explain mechanics of multiple full text systems, to give users more precious time to concentrate on the search strategy, the search interface, advanced searching etc.

SESSION 3. ARCHIVING, CONTENT PRESERVATION, AND LONG TERM ACCESS

Moderator's Comments

Bernard Dumouchel, Director General, CISTI, National Research Council Canada

The STM content community is involved in a number of projects, initiatives and activities that demonstrate the necessity, feasibility and sustainability of content preservation and long term access. This session highlights some of these initiatives that have created new opportunities through digitisation of past content by content owners, and other strategies in libraries.

The Past is a Different Database - They Do Things Differently There

Jeff Pache, Inspec Electronic Product and Service Development Manager, The IEE

The title of the session implies future-proofing access to today's scientific record but Pache mainly discussed “present-proofing” access to yesterday's scientific record by looking at Inspec's archival backfile project. This project set out to produce an electronic version of the printed Science Abstracts journals from 1898 to 1968 as a backfile to the Inspec database which was established in 1969. The backfile data covers 71 Years in 176 volumes, 135,000 pages, 873,700 abstracts, and 2,100,000 index entries, producing, via 25 GB of PDFs, one database of 1.5GB XML plus 3,675 Gifs.

Cultural differences of the past include a different view of the literature and a different view of people (consider, for example, “Mme P. Curie”). The medium was unstructured print whereas, today, structured XML is used. Manual handling caused errors and gaps that computer validation would have prevented.

Pache outlined some of the problems Inspec encountered. Obtaining a copy of the raw material that could be destroyed in the conversion process was not always easy. No amount of sampling of the raw material could find all the anomalies. There were curiosities such as handwritten tables, and records where two articles were very similar so the abstractor had handled them as one. There were anachronisms in the treatment of authors, bibliographic control was of a lower and variable standard, and there were problems such as items saying “see previous abstract” which were meaningful in a print version but not in an electronic one.

Sources of errors were of more than one type. The original printed data had typographical errors (e.g., “Fench”, “Ferch”, “Russsian”) and omitted data. Misreading, misinterpreting and mis-keying in the data capture process led to further errors. Automatic correction to deal with both these problems can introduce further errors.

There was variety in the styles and formats of classification and indexing: zero, one, two, or three levels of subject headings, and class codes, or no codes and UDC. Archaic terminology was encountered, e.g., “cosmogamy” has become “cosmology”. Automatic application of modern terms might have applied terms before their time so it was not implemented. Solutions were to map the original indexing and classification to terms and codes from the current Inspec thesaurus and classification, and to create additional “non-current” terms for extinct technology and theories.

Backfiles of significant age have a very different flavour from current raw data. Understanding the data and the assumptions implicit in them is crucial. Capturing the data in a computer file is but a small part of the process. The more the analysis, the better the final product.

An Archive of Physics: 130 Years of Scientific Research Online

Tony O'Rourke, Assistant Director Journals, Institute of Physics Publishing Ltd. (IOP)

IOP has digitised its entire journal archive ( http://www.iop.org/EJ/ ) for 1874 to 2003. The London Physical Society published its first journal, Proceedings of the London Physical Society in 1874. That journal (1874-1968) has become the Journal of Physics series. Also published were the Journal of Scientific Instruments (1923-1967), Reports on Progress In Physics (1934- ), British Journal of Applied Physics (1950-1967), and others from the 1960s onwards.

There is a demand for historical content: APS (1894-), ACS (1879-), RSC (1841-), and Elsevier (1823-) have produced backfiles. Inspec (1968-) with the AXIOM platform has also seen the demand for older content. The IOP vision is to link seamlessly to older content in references, regardless of age. Digitisation of the IOP backfile supports IOP's prime objective and Royal Charter and provides a service to researchers, members, and libraries.

The IOP Journals Archive contains all peer-reviewed content published by IOP and IOP partners. It is the IOP's most ambitious publishing project ever, covering 800,000 pages and 100,000 articles for 1874-1991. The project, which took 1 year, was completed in December 2002. IOP already had electronic data for 1991-2002, the 1996-2002 part through electronic publishing and 1991-1995 data through previous digitisation.

IOP first created a specification. It found big variations in publishing practices and standards, for example, some papers had no authors, no references, or four titles. Apex Data Services was chosen as supplier. Hardcopy was located: stock was used for post-1968 journals and IOP library hardbound journals for pre-1968. Some copies were borrowed from around the world. It took two staff six weeks to prepare the stock.

Keying to XML was carried out: XML metadata, XML header and XML reference list. (Some publishers have chosen not to capture references but IOP did.) Simplicity of data structure was the aim. Full text was captured as “searchable image” PDF (image with OCR text). The specification balances quality with file size. Scanning was carried out from “disbound” volumes in order to achieve the highest quality. The data was then loaded and checked.

The sales and marketing strategy was to make the archive widely available, creating a new revenue stream in order to recover costs but access was free during 2002 to create brand awareness. The comprehensive sales and marketing plan used e-marketing, Web marketing, press coverage, advertising, telemarketing, direct mail and the expertise of regional managers.

The archive is defined as all content older than 10 years. In 2004, this covers the period 1874 - 1993. A l ow annual subscription charge for the archive and a site licence are offered. There is a higher charge for non-subscribers. The archive is also included in certain journal packages. Consortia are charged a fee based on the size of the consortium. Local loading and permanent access to PDFs and/or reference data can be acquired for a one-time fee. There is then no need for an annual subscription, although annual updates are optionally available. Permanent access via IOP servers is also possible for a one-time fee. XML references are optional in the local load option.

There have been significant downloads of the archive: over 200,000 full text downloads in 2003 (10% of all full text downloads). There is a long tail-off in the age of downloaded material: usage of articles from 1985 is similar to usage of those from 1965. One fifth of all electronic journal site accesses are to the archive (full text or abstracts or references). The success can be judged by the fact that one third of IOP institutional customers have taken out subscriptions or purchased the archive.

Digital Archiving at Elsevier

Joep Verheggen, Managing Director, ScienceDirect

Verheggen's presentation focused on journal content. There can be confusion when talking of archives between (1) ongoing access to current services and (2) long-term storage and preservation of the intellectual content. Elsevier provides for both in its licences but this presentation was primarily related to (2).

Many university and corporate libraries have cancelled paper and use electronic only, and this is increasing weekly. E-only puts greater pressure on archival preservation, and archiving of both the print and the electronic versions. Archiving is high on the agenda of individual libraries and library groups. Elsevier takes digital archiving seriously: Elsevier's responsibility to authors, and responsibility for maintaining “the minutes of science”, the importance to the library community and interest in maintaining an asset.

Elsevier has participated in discussions, projects and committees related to digital archiving since 1995. It was among the first (after AIP) to make a public archiving commitment and perhaps the first to incorporate it in a licence. The company is currently making a multi-million dollar investment in internal back-up systems.

Since 1999, all ScienceDirect (SD) licences for online service contain an annex specifying that Elsevier will maintain a permanent archive of the SD journals it owns, will convert the archive as the technology used for storage or access changes, and will transfer the archive to an independent, librarian-approved depository if the company cannot maintain it.

There are more than 1800 Elsevier journals on ScienceDirect. Elsevier is retro-digitising, that is, creating digital backfiles, starting from volume 1 issue 1 of all titles. It expects to have more than 6 million articles on ScienceDirect by the end of this year. The original size estimate of the total file was 50 million pages, occupying 6.5 to 7 terabytes. The project was started in 2001 and will be completed in 2004.

Verheggen defined four types of archive. The internal production “archive” is an electronic warehouse, not ScienceDirect. “ De facto archives” are held by about 10 regular ScienceDirect OnSite (SDOS) customers world-wide who get everything or nearly everything for local loading (but make no archiving commitment beyond their constituency). Self-designated “national” archives are held by libraries or other institutions that choose to maintain an archival copy locally as a national security measure: a variation on the SDOS licence. The “official Elsevier archive” is a formal, contractual relationship between Elsevier and a trusted archival institution to provide permanent retention and access to the digital files for future generations.

The official Elsevier archives began with an investigative project the company did with Yale University Library (with funding from the Mellon Foundation) which was completed in early 2002. Elsevier signed the first formal agreement for an official archive with the Koninklijke Bibliotheek (KB) in August, 2002 and is likely to do 3-4 additional agreements, in North America, Asia and Europe.

KB is a recognised international leader in digital archiving investigations and also the national library of The Netherlands. Elsevier was already sending KB electronic files for its 351 Dutch imprint journals. This will now expand to the entire 1,800 title journal list, which the KB will archive “forever”. The contract for the official archive is different from a normal licence for SD, reflecting the perpetual nature of an archive, and specifying a service level agreement, trigger events for public access, financial terms, the format for submission, and the comprehensiveness of the archive (e.g., handling of “withdrawn” material). As standards for archival repositories develop, KB must meet these.

The official archives are available for walk-in users now and available remotely to anyone in the event that Elsevier exits the business and no-one else takes over. In the event of a disaster that would result in ScienceDirect being down for a prolonged period, all libraries holding the journals (archives or SDOS) would be invited to open up access to all users with no access controls.

Technical aspects of the archive are based on the LOCKSS principle (Lots of Copies Keep Stuff Safe). The hardware is a hosting system in Dayton, located in a bunker that is tornado-, earthquake-, and aircraft impact-proof. Daily incremental backups, and weekly complete backups are taken. Off-site copies of backups, and extensive recovery procedures are in place. Migration to a new type of hardware format takes place on every new version release.

All software formats are generally accepted standards developed to last or be easy for migration. Text is in full SGML, to be converted to XML this year. Older content is “head and tail” in SGML/XML. Text is in PDF format, derived from a Postscript file. Older content is of laser printer quality (from 300 dpi scanning). Images are in TIFF, JPEG, or GIF (for Web applications). Elsevier supports a small number of multi-media file formats that will be usable in coming decades.

Archives, Repositories, and the Effects of Library Time

Davis Seaman, Executive Director, Digital Library Foundation (DLF)

DLF ( http://www.diglib.org ) has 33 partners, mostly US university libraries, and four allies (RLG, OCLC, CNI and LANL). Non-US members are currently being solicited: The British Library (BL) joined recently.

In the print arena, access hurts preservation and “benign neglect” is your friend. In the digital world, the reverse is true: that which will survive and will need frequent refreshing will be easily reshaped, useful, functioning well, and easily harvested. Different timescales complicate discussions across the library-publisher divide: libraries talk not of 5-10 years but of maybe more than 30 years. There is a real interest in the mass Digital Opportunity Investment Trust (DO-IT). Other initiatives are ongoing at the US Government Print Office, which has 2.2 million documents, and at Stanford. Amazon “inside the book” also uses Google.

Preservation and conservation is a constant, expensive effort. We have seen the rise of institutional repositories, and preservation initiatives such as DSPACE. There is a growing awareness of the dangers of giving away our research to publishers, and libraries are refusing the “big deal”.

“Born-digital” material is also part of exciting efforts such as NDIIPP, the national digital information infrastructure preservation programme, the Digital Curation Centre in the UK, preservation metadata in Australia and OCLC RLG efforts.

Which organisations should be driving these expensive efforts? Traditionally it was thought that libraries should do it since they were less ephemeral than other organisations. Now the National Science Foundation is involved – see http://www.cise.nsf.gov/sci/reports/atkins.pdf . The NSF does not mention publishers very often. Publishers do not have chief preservation officers, although Seaman applauds Elsevier's archiving efforts. There are lots of opportunities for partnerships if publishers were more willing to partner. Proven buzzwords are persistence, partnerships (in common metadata and forums such as NDIIPP), aggregation and malleability.

Seaman listed some solutions. The same standards and methods that fix near-term access also greatly aiding archiving, e.g., XML documents. Professional and reliable cataloguing matter. Harvestable metadata must be delivered. Digital rights management (DRM) as an enabling force for scholarly enquiry is not as punitive as it seems. We must stop thinking about single products and silos and think about RSS and relationships.

Richard Boulderstone of the BL made some final remarks. Usage of physical libraries is falling. BL is grappling with these trends and with the growth in e-journals and the increase in disk storage world-wide. No-one is thinking enough about archiving all this material. The methods for finding scientific articles are also changing: Google is used more and A&I services are used less.

Components of potential solutions are discipline-specific collaborative environment s, enhanced search services, and highly secure, certified repositories. Repositories must allow text and data deposit, have sophisticated linkages, be highly scalable, address DRM, ensure digital preservation and be virus- and worm-resistant.

BL is strong in information management but it needs partners if it is to tackle all these issues.


This page updated on 10 July 2004