Parsing PDF get same text twice in different page

Question

I have a PDF file which contains 2 pages. When I parse it with my parser, in Ojective-C, I have the following situation.

For the first page everything is Ok, I have text that I should have (that I visually see in pdf readers like Preview, Adobe reader ...). For the second page I have the text that I see in the second page PLUS a part of the text from the first page, that is not in the second page.

I tried with others parsers : pdftotext (xpdf) they managed to have the correct result. Pdfminer (in python) https://pypi.python.org/pypi/pdfminer/, I got the same result as I had. A part of thext from the first page is extracted twice.

My question is : How can this happen ? Have you ever seen this situation ? If the text is really present in the second page, why don't pdf readers show it ? Do you have any thoughts about this ?

Maybe the text is there on the page, but not visible because of: something called the "Crop Box", something called "OCG",... Maybe the text is white on white. Without seeing the actual PDF, one can only guess, but there are many possible reasons. — Bruno Lowagie, Jun 03 '13 at 12:11
I tried to open the file in Illustrator and in Acrobat Pro, I saw nothing. I also tried to select text in second page, nothing more than the text that we can see. Thanks for sharing your thoughts. More ideas are welcome. — bob, Jun 03 '13 at 12:15
I had a look in the PDF references, there is no OCG in my pdf since it's a 1.4 pdf and optional contents only begin in v. 1.5. I verified my document catalog dictionary and I don't have OCProperties entry. — bob, Jun 03 '13 at 13:01
I'll ask for permission and I'll get back to you. Thank you for your help. — bob, Jun 03 '13 at 16:42
@BrunoLowagie Here is the [file](http://www.sendspace.com/file/mzjfif) — bob, Jun 04 '13 at 09:37
The contents of both document pages are very similar. The only exceptions are a) that the first page has a "30 BEAUTÉ ANIMALE" while the second one has a "31 PRÉJUGÉS" and b) that the remaining text is shifted aside beyond the media box. If you want to extract only what is visible, you need to filter by area. — mkl, Jun 04 '13 at 16:12

score 2 · Accepted Answer · answered Jun 04 '13 at 17:52

I've ran your file through Acrobat (using "Examine Document") and it tells me there's some hidden text in it. Take a look at the following screen shot:

enter image description here

The text in red in the screen shot marks what is hidden. As mkl indicates, it's present OUTSIDE the MediaBox, which makes it invisible when looking at the document in a PDF viewer. That doesn't mean the text is there. If you look inside the content stream (which is what parsers do), you'll still find it.

Your parser should discard everything that is outside the MediaBox. Normally there's an option to do that. I know there is one in iText; I don't know about other parsers.

Parsing PDF get same text twice in different page

1 Answers1