I got XHTML file .hocr from tesseract 3.03 on Ubuntu 14.04LTS. How can I put data from this file to an object in java? Or how else I can work with this? Unfortunatelly for me, Im unexperienced with working with XML files, so any help would be much appreciated.
example of file:
<div class='ocr_page' id='page_1' title='image "test2jpg.jpg"; bbox 0 0 10000 10000; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 250 192 8637 686">
<p class='ocr_par' dir='ltr' id='par_1_1' title="bbox 250 192 8637 686">
<span class='ocr_line' id='line_1_1' title="bbox 250 192 8637 414; baseline 0 -40">
<span class='ocrx_word' id='word_1_1' title='bbox 250 192 1606 375; x_wconf 70' lang='eng' dir='ltr'>NAME</span>
<span class='ocrx_word' id='word_1_2' title='bbox 1676 192 3051 375; x_wconf 73' lang='eng' dir='ltr'><strong>FIRSTNAME</strong></span>
Unique identificator should be "word_1_X" where the X stands for number.
Point is to get NAME and FIRSTNAME and their possitions in picture. For example:
word_1_1 has X1=250 Y1=192
X2=1606 Y2=375
string value NAME.
Any ideas how to simply achieve this?