0

I'm trying to get data from a dictionnary (this one : http://vk.com/doc8069473_312422685?hash=78fd2d459ed8547b29&dl=86147ab2323652f43d). I use PDFBox to extract the text from this pdf file.

In order to do that, I created a class "Article" to store each word, its type (adj, noun, etc...), all its definitions and all its examples.

I use regular expressions to find the beginning and the end of each article.

Here is the pattern I use (PHNTC is added by me to replace phonetic notations):

Pattern pattern = Pattern.compile("(((\\w|\\–|\\-|&|,|’|/|â|é|è|ê|à|ô| )*)(\\s)+(PHNTC( )+)?(abbr|adj|adv|article|conj|interj|modal verb|noun|plural noun|prefix|prep|pron|phrase|suffix|(?<!((forming|making part of) a ))verb|expr)(, (abbr|adj|adv|article|conj|interj|modal verb|noun|plural noun|prefix|prep|pron|phrase|suffix|(?<!((forming|making part of) a ))verb|expr)\\s)?[^a-z]|((\\w|\\–|\\-|&|,|’|/|â|é|è|ê|à|ô| )*)(\\s)+(PHNTC( )+))");

As you can see, it is quite complicated, and even if it is sufficient for 99% of the articles (I have about 100 "wrong" articles among 29,000 articles), I still have some problems. For example, if "noun" is written somewhere in a definition, my program might think it is the beginning of a new article ! You can see in the code above my attempts to solve some ambiguities with "verb".

I think that the only solution to solve those problems would be to put some markups around bold texts and italic texts. I would like to use something like this :

Pattern pattern = Pattern.compile("<b>.*</b>(\\s)+(PHNTC( )+)?<i>.*</i>(, <i>.*</i>)?");

And now, here is my problem : how could I put those markups using PDFBox ?

I found a subject (How to extract bold text from pdf using pdfbox?) about extracting bold text (by overriding the method processTextPosition( TextPosition text ) from PDFTextStripper).

I tried it but :

1) I failed to find bold text

2) I don't want to extract only bold text, I still want to extract everything !

Any ideas ?

Community
  • 1
  • 1
  • I think you meant "markup" when you wrote "balise", so you may want to edit the question. (see https://fr.wikipedia.org/wiki/Langage_de_balisage ). Re your question, I suspect that the answer in the other question you link to does mention the problem you're having, i.e. the difficulties to identify what is bold. (There is no concept of "bold" in PDF) – Tilman Hausherr Jul 13 '15 at 06:03
  • Thank you for your answer, I edited my question (I thought the word existed in English). Indeed it is mentionned, but the idea in the other question is to identify bold text and to execute the extraction only for this text. This is not really what I want to do, as I would like to surrounds bold text during the extraction with markups so that the final results contains and .. Moreover, I tried to identify bold text with the tricks used in the question I linked to, and nothing worked for me. – Guillaume COTER Jul 13 '15 at 07:15
  • @GuillaumeCOTER The code from [this answer](http://stackoverflow.com/a/25290318/1729265) may help. It shows how to check for font changes and insert hints to that effect. – mkl Jul 15 '15 at 22:15
  • Thank you very much for your answer, I will take a look and try. – Guillaume COTER Jul 16 '15 at 19:59

0 Answers0