User Manual of Co-Reader

This page is a distribution site for the ground-truthed dataset for use in document analysis and recognition experiments.

Dataset for table recognition

Description

In total, 2000 pages in PDF format were collected and the corresponding ground-truths were extracted utilizing our semi-automatic ground-truthing tool "Marmot".
The dataset is composed of Chinese and English pages at the proportion of about 1:1.

The Chinese pages were selected from over 120 e-Books with diverse subject areas provided by Founder Apabi library, and no more than 15 pages were selected from each book.
The English pages were crawled from Citeseer website.

The pages show a great variety in language type, page layout, and table styles. Among them, over 1500 conference and journal papers were crawled, covering various fields, spanning from the year 1970, to latest 2011 publications.
The e-Book pages are mostly in one-column layout, while the English pages are mixed with both one-column and two-column layouts.

Download

Marmot Dataset v1.0
Table Detection Evaluator v1.0

Dataset for math formula recognition

Description

This is a ground-truth dataset and evluation tool for mathematical formula identification. We collect documents through crawling PDF documents from CiteSeerX.
In total,the dataset contains 400 document pages with 1575 isolated formulas, and 7907 embedded formulas, which are selected from 194 digitally originated PDF documents.
The dataset includes not only digitally originated PDF files, but also their corresponding document images. Also, metadatas of the documents are included.

The ground truth of mathematical formulas in each document page includes that the precise bounding boxes of the isolated/embedded formulas.
It also includes the objects(characters, graphics, and images) in each isolated/embedded formula. For each object, a bounding box is provided. For character objects, the character's Unicode and font size are provided, too.

An evaluation tool base on the ground-truth dataset is provided. This evaluation tool is based on the ground truth format defined in our Dataset.

This dataset is a public database that is freely usable for research purposes.

Download

Marmot Math Dataset v1.0
Math Formula Detaction Evaluator v1.0

Dataset for math formula identification in Chinese documents

Description

This is a ground-truth dataset for mathematical formula identification in Chinese documents.
In total,the dataset contains 200 document pages with 1166 isolated formulas, and 3022 embedded formulas, which are selected from 24 digitally originated CEB documents.

The ground truth of mathematical formulas in each document page includes the precise bounding boxes of the isolated/embedded formulas.
It also includes the objects(characters, graphics, and images) in each isolated/embedded formula. For each object, a bounding box is provided. For character objects, the character's Unicode and font size are provided, too.

This dataset is a public database that is freely usable for research purposes.

Download

Marmot Chinese Math Dataset v1.0

Dataset for layout analysis of fixed layout documents

Description

This is a ground-truth dataset for layout analysis of fixed-layout documents.
In total, the dataset contains 244 pages selected from 35 Portable Document Format (PDF) documents. Primitive objects of page content include text, images and graphics. Primitives are further grouped into ``fragments'', which contain proximate primitives of the same type. For example, text fragments are usually text lines. Currently, logical labels are assigned to fragments. Labels include body text, title, figure, figure annotation, figure caption, figure caption continuation, list item, list item continuation, table cell, table caption, equation, page number, footer, header, footnote, and marginal note.
This dataset is a public database that is freely usable for research purposes.

Download

Layout Analysis Dataset v0.1

Dataset for ICDAR 2017 POD Competition

Description

The competition dataset consists of 2000 English document page images selected from 1500 scientic papers of CiteSeer. The dataset shows good variety in both page layout styles and object styles, including single-column pages, two-column pages, multi-column pages and various kinds of formulas, tables, graphics and figures. In the dataset, each page image is accompanied by a XML file containing its ground truth describing the three kinds of objects to be detected: formulas, tables and figures or images(including charts). More details can be found in the competition homepage ( https://cndplab-founder.github.io/ICDAR2017_POD/index.html ).

This dataset is a public database that is freely usable for research purposes.

If you have results to report on this dataset, please send email to gaoliangcai@pku.edu.cn.

Please also cite the version number of the dataset you used, in order to facilitate comparison of results. Many thanks for your cooperation!

Copyright (c) 2011 by Institute of Computer Science and Techonology of Peking University and Institute of Digital Publishing of Founder R&D Center, China.

Permission is granted, free of charge, to any person or group obtaining a copy of the dataset and evaluator source code with research motivation only, including without limitation the rights to use, copy, modify, and distribute all the files.