Why does pdftotext squash words together sometimes?

Asked Sep 09 '19 at 09:28

Active Sep 09 '19 at 09:28

Viewed 62 times

I am trying to convert some pdfs into text using pdftotext and the conversion is happening but some words are getting squashed together. For example, the 2nd day becomes the2nd day, before me becomes beforeme and so on. Why does this happen and how should I get rid of these discrepancies?

I have tried using okular(since I use linux) to convert pdf to text but that also gives me the same kind of output. And this is bothering because it hinders text extraction a lot.

asked Sep 09 '19 at 09:28

anushka

Relevant https://stackoverflow.com/a/11087993/5320906 – snakecharmerb Sep 09 '19 at 09:37
@snakecharmerb thanks for pointing me there. I now understand that a space on a pdf can be non printable, when they dont get selected on a pdf reader. But what would be the solution to finding these places and putting spaces so that the word squashing doesn't happen. – anushka Sep 09 '19 at 09:45

Why does pdftotext squash words together sometimes?

0 Answers0