METS/ALTO Introduction
A brief introduction to METS and ALTO standards. Learn what METS and ALTO files are used for, how they are structured and how they relate.
Why would I use METS and ALTO?
-
- “I digitize lots of different items and each type is digitized to a different format. Some are Word, some are PDF, some are XML, some are just JPG – I need it all to comem same structure, but what is the best way?”
- “I’d like to offer full text search to the scientists and researchers in the complete collection, so I need to build a text index of the complete collection – and I might want to change the presentation system in three years, so I need the source data in a standard, non-proprietory format”
- “I need to organize my digitization project accordingly to long term preservation standards – how long will JPG exist? I need something more robust and safe for the next 100 years.”
What is METS?
-
- METS -> Metadata Encoding and Transmission Standard
- Established in 2001
- XML based open standard
- Schema is hosted at Library of Congress (LOC)
- Maintained by METS Editorial Board
- Current version: 1.12. Version 2 is in preparation
- Used for long-term preservation
- https://www.loc.gov/standards/mets/
What does a METS file look like?
A METS XML file begins with the METS Header, followed by five 5 metadata sections.
1. METS Header <metsHDR>
The METS Header itself contains technical information, about the METS document itself, including such information as creator, editor, etc.
2. Descriptive Metadata <dmdSEC>
Typically MODS (https://www.loc.gov/standards/MODS) or similar metadata schemata are deployed to describe the object itself. It contains information like title, author, publisher, and publishing date.
3. Administrative Metadata <amdSEC>
Contains information about the image capturing process like the scanning hard- and software used, file type, resolution, compression, date of image capture. Typically MIX (https://www.loc.gov/standards/mix/) or a similar metadata schema is used.
4. File Section <fileSEC>
Lists, describes, and links to all files that belong to the digital object described by the METS file.
For a typical printed object (Book, periodical, newspaper) one image files (tiff, jpg, or JPG2000) and one ALTO XML file would be linked here par page.
Additionally, issue-level PDF or ePub files could be linked here too.
5. Physical Structure <structMap LABEL=”Physical Structure”>
For a typical printed object (Book, periodical, newspaper) here the physical pages are listed with their page numbers and links to the page level files specified in the File Section before.
6. Logical Structure <structMap LABEL=”Logical Structure”>
For a book, this section would typically contain the table of contents where logical section of the books are linked to pages.
For more complex object like a newspaper, the structure might by more deeply nested to describe one article that contains various elements (title, images with captions, text blocks) spread over multiple physical pages.
For more details see https://www.loc.gov/standards/mets/METSOverview.v3_en.html
What is ALTO?
-
- ALTO –> Analyzed Layout and Text Object
- XML based open standard
- Schema is hosted at Library of Congress (LOC)
- Maintained by ALTO Board
- Current version: 4.4
- https://www.loc.gov/standards/alto/
What does ALTO do?
-
- Contains the content of a single page
- Describes the layout of a printed page to re-build the original page
- Describes the styles, layout and block type information
- May contain tags which contain more information about content (e.g. named entities)
What Does an ALTO file look like?
ALTO XML details technical metadata for describing the layout and content of physical text resources, such as pages of a book or a newspaper. ALTO files usually have 3 sections:
1. Description
This section contains technical information relation to the ALTO file like a definition of the measurement unit used and information on the OCR software used.
2. Styles
This section collects information about the layout style of the page described. Typical information are font-family, -style and -size, paragraph spacing and alignment.
3. Layout
The core content of the alto file is contained here. All objects (words, lines, text blocks, pictures, tables) of the page are listed here with coordinates and OCR transcription where applicable. The structure can be flat or more complex depending on the type of material.
Later Version of the ALTO schema provide support for handwritten material and allow to provide multiple ORC results with probability values.
For more details see https://www.loc.gov/standards/alto/techcenter/structure.html
How METS and ALTO work together
Summary
Why would you use XML standards?
-
- Fully documented XML format
- Can be used by any IT provider now and in future
- Can be transformed to other formats in the future (for long-term preservation)
- Readable by humans
What are your benefits using METS/ALTO?
-
- Open Standard
- Free to use for everyone
- It is the industry standard for digitization used by hundreds of libraries and content providers
- The long-term sustainability of your digital objects is greatly enhanced
- Supports article and chapter segmentation
- You can handle objects in an easy way and exchange them with other parties
- You can create PDF, EPUB, DAISY and other formats from METS/ ALTO