2

I recently tried to use PDFBox to extract text from a PDF file. It works fine for most PDFs, but for one PDF (which unfortunately I am not permitted to share), all of the periods in the sentences do not get extracted out. Instead, I get phrases like the following:

...what it would be It’ll be important later on...

It looks like instead of a period-space it is just a space, but it's not (at least on Mac OS X). If you copy the text into a text editor and start moving the text cursor through the phrase, there is an "empty character" right after the "t" in "feet". To reproduce:

  • Place the cursor right before the letter "t" in "feet" and press the right arrow key. The cursor moves one step to the right.
  • Press the right arrow key again, you stay right where you are.
  • Press the right arrow key one more time, you continue to the other side of the space.
  • Continuing to press the right arrow key behaves as expected

It appears that PDFBox extracted some sort of "empty character" in place of a period. I've tried to replace it a few different ways but have had no luck:

String oldText = text;
text = text.replace('\u0000', '.'); //Unicode null
text = text.replace('\0', '.'); //C null
System.out.println(oldText.equals(text)); //Returns true
//Also tried text.replace(null, '.'), but it doesn't compile

What is this "empty character" and how can I replace it with the text that is supposed to be there?

EDIT: This answer suggested that the character might be a character such as \uFEFF, but trying to replace it with a regex as suggested did not work.

Community
  • 1
  • 1
Thunderforge
  • 19,637
  • 18
  • 83
  • 130
  • The fact that you can't share the document (and felt compelled to say so) makes the excerpt, "places his feet Either way" somewhat creepy :-P –  Mar 26 '13 at 23:41
  • Haha, yeah! Perhaps that wasn't the best example to share because it was out of context. I've changed the example. – Thunderforge Mar 26 '13 at 23:46

1 Answers1

2

After realizing that the character wasn't \uFEFF or \u0000, two values of unicode that other Stack Overflow users had run into, I decided to run a test to figure out what the code actually was. Using the code in this answer to determine what the unicode value was, I figured out that the mysterious character was \u0008, which is "backspace". Why that would have been pulled from the PDF, I don't know, but text = text.replace('\u0008', '.') now replaces it with the missing periods.

Community
  • 1
  • 1
Thunderforge
  • 19,637
  • 18
  • 83
  • 130