/////////////////////////////////////////////////////////////////////////////////
//            Ground truth for Mathematical formula identification             //
//                       (First Distribution, 13/06/2013)                      //
/////////////////////////////////////////////////////////////////////////////////

1. Overview
   This is a ground-truth dataset for mathematical formula identification in Chinese documents. 
   In total,the dataset contains 200 document pages with 1167 isolated formulas, and 3056 embedded formulas, which are selected from 24 digitally originated CEB documents.

   The ground truth of mathematical formulas in each document page includes that the precise bounding boxes of the isolated/embedded formulas. It also includes the objects(characters, graphics, and images) in each isolated/embedded formula. For each object, a bounding box is provided. For character objects, the characters Unicode and font size are provided, too. 

   This dataset is a public database that is freely usable for research purposes.

---------------------------------------------------------------------------------
2. Organization
   The extracted file folder structure is shown as follows:

   Dataset
    ground truth
    CEB
  
     - The ground truth of each document page is included in "ground truth" file folder. The ground truth format is described in the following section.
     - The digitally born CEB files are included in "CEB" folder.

---------------------------------------------------------------------------------
3. Ground truth format
   1) Page
   The ground truth for each document page is stored within the <page> </page> tag pair. The page number and the bounding box of the page are stored as the attributes named "PageNum" and "BBox". 

   2) Isolatd formula
   Each isolated mathematical formula is stored within a pair of <IsolatedFormula> </IsolatedFormula>. The bounding box of the isolated formula is represented as the attributes named "BBox". 
   The objects in the isolated formula are represented as the children of the <IsolatedFormula> tag pair. The objects parsed from the PDF documents include characters, graphics, and image objects. They are presented as <Char> </Char>, <Path> </Path>, and <Image> </Image> tag pairs respectively. For each object, its bounding box is stored as the attribute of the tag pair, named BBox. For character objects, its Unicode and font size, named "Text" and "FSize", are stored as the attributes of <Char> tag pair. 

   3) Embedded formula
   The ground truth format of the embedded formula is similar to that of the isolated formula.

   4) Bounding box
   The bounding box mentioned in this paper is the precise bounding box, which is represented by the coordinates of the top left corner and the bottom right corner. To preserve the precision of the floating numbers, they are represented in hexadecimal digits. In our ground truth representation, the origin of the coordinate space is the bottom left corner of the document page and the length unit is point (1/72 inch).

---------------------------------------------------------------------------------
4. Copyright
This dataset is only for research usage, any redistribution or commercial usages should ask for permission from the copyright owner individually.

---------------------------------------------------------------------------------

