Convert PDF file to XML format

Drop Files Here (max 20 MB)


Browse
Converted Files
 
> Download Cancel

Alert!

Output Data1
Formatting
Page Origin2

Notes

  1. "Runs of Text" are strings that are on a single line that share common display properties. They share font, font-size, bold, italic, rotation and color. They are useful for composing formatted output (e.g. rendering using html/css). They may split words if one of the letters in the word has different properties. A typical run of text will look as follows: <txtRn> <leftX>323.999</leftX> <bottomY>292.9298</bottomY> <rightX>557.9752</rightX> <topY>283.9631</topY> <baseLineY>290.7679</baseLineY> <fontId>2</fontId> <text>scanned document pages is a process of partitioning a docu-</text> </txtRn> where leftX, bottomY, rightX, topY define a box that encapsulates the text. The "baseLineY" field is the font baseline. Rotated runs of text will have a rotation field that defines the angle specified anti-clockwise in radians and the point of rotatation. <rotation> <pivotX>119.4425</pivotX> <pivotY>24.96</pivotY> <angle>-0.9147</angle> </rotation>
  2. Most Applications use top-left of the page as the (0,0) origin with increasing Y going down the page, however the PDF spec uses bottom-left with increasing Y going up the page.