PDF conversion with poppler-utils: Is there a way to avoid decoding difficulties?

Asked Apr 13 '23 at 20:07

Active Apr 13 '23 at 20:07

Viewed 49 times

I am converting pdf to text using poppler-utils and the pdftotext-function on Ubuntu. Unfortunately I keep running into a problem where some files are not converted decently.

A correctly converted file looks like this:

  82 => '23:00 23:00 - 05:00 05:00 01:30',
  83 => 'Page 1 of 5',
  84 => 'Generated on Feb 05, 2023 17:11',

But some files result in something like this:

  82 => 'WĂƌƚŝĂůK&&;ĞŶĐƌŽĂĐŚĞĚďǇ',
  83 => 'ĚƵƚǇͿ',
  84 => 'ϬϬ͗ϭϯͲϮϯ͗ϱϵ D',

Both documents are pdf version 1.4 and appear to have been encoded with the same software, so I'm at a loss, what is causing this problem.

Does anyone have a suggestion, what to try next?

asked Apr 13 '23 at 20:07

lowflyer7

Thanks, that helped. The reading is garbled too and I am not able to see any logic. – lowflyer7 Apr 17 '23 at 13:18
1

If I acquire the file through another device/browser combination, it works fine. I suspect it has something to do with the browser version/configuration?! – lowflyer7 Apr 17 '23 at 13:37

PDF conversion with poppler-utils: Is there a way to avoid decoding difficulties?

0 Answers0