I'm trying to get data from a dictionnary
(this one : http://vk.com/doc8069473_312422685?hash=78fd2d459ed8547b29&dl=86147ab2323652f43d). I use PDFBox
to extract the text from this pdf file.
In order to do that, I created a class "Article" to store each word, its type (adj, noun, etc...), all its definitions and all its examples.
I use regular expressions to find the beginning and the end of each article.
Here is the pattern I use (PHNTC is added by me to replace phonetic notations):
Pattern pattern = Pattern.compile("(((\\w|\\–|\\-|&|,|’|/|â|é|è|ê|à|ô| )*)(\\s)+(PHNTC( )+)?(abbr|adj|adv|article|conj|interj|modal verb|noun|plural noun|prefix|prep|pron|phrase|suffix|(?<!((forming|making part of) a ))verb|expr)(, (abbr|adj|adv|article|conj|interj|modal verb|noun|plural noun|prefix|prep|pron|phrase|suffix|(?<!((forming|making part of) a ))verb|expr)\\s)?[^a-z]|((\\w|\\–|\\-|&|,|’|/|â|é|è|ê|à|ô| )*)(\\s)+(PHNTC( )+))");
As you can see, it is quite complicated, and even if it is sufficient for 99% of the articles (I have about 100 "wrong" articles among 29,000 articles), I still have some problems. For example, if "noun" is written somewhere in a definition, my program might think it is the beginning of a new article ! You can see in the code above my attempts to solve some ambiguities with "verb".
I think that the only solution to solve those problems would be to put some markups around bold texts and italic texts. I would like to use something like this :
Pattern pattern = Pattern.compile("<b>.*</b>(\\s)+(PHNTC( )+)?<i>.*</i>(, <i>.*</i>)?");
And now, here is my problem : how could I put those markups using PDFBox ?
I found a subject (How to extract bold text from pdf using pdfbox?) about extracting bold text (by overriding the method
processTextPosition( TextPosition text )
from PDFTextStripper
).
I tried it but :
1) I failed to find bold text
2) I don't want to extract only bold text, I still want to extract everything !
Any ideas ?