An "Empty" Character Extracted from a PDF

Question

I recently tried to use PDFBox to extract text from a PDF file. It works fine for most PDFs, but for one PDF (which unfortunately I am not permitted to share), all of the periods in the sentences do not get extracted out. Instead, I get phrases like the following:

...what it would be It’ll be important later on...

It looks like instead of a period-space it is just a space, but it's not (at least on Mac OS X). If you copy the text into a text editor and start moving the text cursor through the phrase, there is an "empty character" right after the "t" in "feet". To reproduce:

Place the cursor right before the letter "t" in "feet" and press the right arrow key. The cursor moves one step to the right.
Press the right arrow key again, you stay right where you are.
Press the right arrow key one more time, you continue to the other side of the space.
Continuing to press the right arrow key behaves as expected

It appears that PDFBox extracted some sort of "empty character" in place of a period. I've tried to replace it a few different ways but have had no luck:

String oldText = text;
text = text.replace('\u0000', '.'); //Unicode null
text = text.replace('\0', '.'); //C null
System.out.println(oldText.equals(text)); //Returns true
//Also tried text.replace(null, '.'), but it doesn't compile

What is this "empty character" and how can I replace it with the text that is supposed to be there?

EDIT: This answer suggested that the character might be a character such as \uFEFF, but trying to replace it with a regex as suggested did not work.

The fact that you can't share the document (and felt compelled to say so) makes the excerpt, "places his feet Either way" somewhat creepy :-P — , Mar 26 '13 at 23:41
Haha, yeah! Perhaps that wasn't the best example to share because it was out of context. I've changed the example. — Thunderforge, Mar 26 '13 at 23:46

score 2 · Accepted Answer · edited May 23 '17 at 11:49

After realizing that the character wasn't \uFEFF or \u0000, two values of unicode that other Stack Overflow users had run into, I decided to run a test to figure out what the code actually was. Using the code in this answer to determine what the unicode value was, I figured out that the mysterious character was \u0008, which is "backspace". Why that would have been pulled from the PDF, I don't know, but text = text.replace('\u0008', '.') now replaces it with the missing periods.

An "Empty" Character Extracted from a PDF

1 Answers1