Revision as of 21:02, 18 February 2016 by KimSmilay (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

CLOCKSS is a not-for-profit joint venture started by libraries and publishers committed to ensuring long-term access to scholarly publications in digital format. As libraries migrate from print to online-only publications, they expect assurances from publishers that their shared investments are protected and preserved for generations to come. The CLOCKSS archive provides this assurance via its secure network of content that can be accessed only when a trigger event is deemed to have occurred. CLOCKSS is unique because it makes all content triggered from the archive freely available to the world.

Technical Overview

The following is a step-by-step overview of how CLOCKSS works.

Step One

The publisher provides the CLOCKSS system access to either presentation or source files of the content. Presentation files are the HTML pages that are normally displayed to the readers of the content. Source files are minimally formatted content used internally by the publisher.

To allow CLOCKSS crawlers to access the publisher's presentation files, the publisher needs to add to its website a CLOCKSS-provided permission statement that will tell the crawlers what content is available for collection.

CLOCKSS howitworks1.png

To allow CLOCKSS access to the publisher's source files, the publisher needs to place them on a designated FTP site.

CLOCKSS howitworks2.png

Step Two

Special CLOCKSS boxes located at Rice, Indiana, and Stanford Universities ingest the content the publisher made available.

CLOCKSS howitworks3.png

Step Three

The content in each CLOCKSS box must go through a verification process to confirm that their versions of the content are identical to each other. This establishes the authoritative version of the content.

CLOCKSS howitworks4.png

Step Four

The majority of the CLOCKSS boxes are preservation machines, performing the main storage and audit functions. After the quality of the content on the ingest machines is validated, it is collected from them by the preservation CLOCKSS boxes.

CLOCKSS howitworks5.png

Step Five

The content is then preserved through a system of audit and repair. The CLOCKSS boxes continually communicate over the Internet to audit the content they are preserving. If the content in one CLOCKSS box is damaged or incomplete, that CLOCKSS box will receive repairs of the content based on other CLOCKSS boxes' holdings and/or by referring to the publisher's original presentation files. This cooperation between the CLOCKSS boxes avoids the need to back them up individually. It also provides unambiguous reassurance that the system is performing its function and that the correct content is always available.

CLOCKSS howitworks6.png

Step Six

When a trigger event occurs and the CLOCKSS Board decides to release the content from the archive, two things happen: a. Content is automatically migrated to the newest format. b. Content is copied from the CLOCKSS boxes to a publicly available web server at a CLOCKSS host organization (currently the EDINA Data Center, University of Edinburgh, and Stanford University).

CLOCKSS howitworks7.png

Step Seven

The released content is now freely available from Stanford University and EDINA at the University of Edinburgh. It is also directly available via Open URL's through CrossRef (a) or local library link-resolvers (b), or from this list.

Figure a

CLOCKSS howitworks 7a.png

Figure b

CLOCKSS howitworks 7b.png