How to get XHTML file to object in java and how to work with it?

Question

I got XHTML file .hocr from tesseract 3.03 on Ubuntu 14.04LTS. How can I put data from this file to an object in java? Or how else I can work with this? Unfortunatelly for me, Im unexperienced with working with XML files, so any help would be much appreciated.

example of file:

<div class='ocr_page' id='page_1' title='image "test2jpg.jpg"; bbox 0 0 10000 10000; ppageno 0'>
  <div class='ocr_carea' id='block_1_1' title="bbox 250 192 8637 686">
    <p class='ocr_par' dir='ltr' id='par_1_1' title="bbox 250 192 8637 686">
      <span class='ocr_line' id='line_1_1' title="bbox 250 192 8637 414; baseline 0 -40">
        <span class='ocrx_word' id='word_1_1' title='bbox 250 192 1606 375; x_wconf 70' lang='eng' dir='ltr'>NAME</span>
        <span class='ocrx_word' id='word_1_2' title='bbox 1676 192 3051 375; x_wconf 73' lang='eng' dir='ltr'><strong>FIRSTNAME</strong></span>

Unique identificator should be "word_1_X" where the X stands for number.

Point is to get NAME and FIRSTNAME and their possitions in picture. For example:

word_1_1 has X1=250 Y1=192

X2=1606 Y2=375

string value NAME.

Any ideas how to simply achieve this?

Apart from the question which I have difficulties to understand in, use an appropriate XML parser which you are interested in like JAXB (included as part of the Java SE 6 API) or JAXP. There is also an HTML parser JSOUP, if you need it. — Tiny, May 05 '15 at 10:49
possible duplicate of [Java: How to read and write xml files?](http://stackoverflow.com/questions/7373567/java-how-to-read-and-write-xml-files) — Joe, May 05 '15 at 11:28

score 1 · Accepted Answer · edited May 23 '17 at 12:14

You normally use a XML parser to parse XML files.

But as it appears to be actually a HTML file (most likely just the HTML output produced by a XHTML file as part of a JSF web application), then you'd better use a HTML parser.

There are many HTML parsers, one of them most suitable for the task of parsing real world HTML files and extracting specific data would be Jsoup.

Provided that the HTML output is available on the URL http://example.com/some/page.jsf, here's how you could use Jsoup to extract the data of interest:

Document document = Jsoup.connect("http://example.com/some/page.jsf").get();

for (Element ocrxWord : document.select(".ocrx_word")) {
    String text = ocrxWord.text(); // NAME, FIRSTNAME, etc
    String title = ocrxWord.attr("title"); // bbox 250 192 1606 375; x_wconf 70, etc
    // ...
}

After having the title, it would be just a matter of using basic java.lang.String methods to breakdown it further in smaller parts. That responsibility is beyond the scope of the HTML parser, any Java beginner is able to figure it on their own.

Yea, thank you for your help, havent done anything like this yet so Im a bit clueless here + language barriere. I will try it out later today and mark as solved if it will work. — Candybrk, May 05 '15 at 13:50

How to get XHTML file to object in java and how to work with it?

1 Answers1