0

I have used ghostscript to successfully extract text from PDFs that have tables.

This simple command works very well:

gswin64c -sDEVICE=txtwrite -o test.txt "c:\reports\sample.pdf"

However some words get joined together especially from tables, for example:

  234801111111109-12-2014 16:17:04764030208117034 2883253100.00  Payment
  234801111111109-12-2014 16:18:461088956908117033 2883253400.00 Payment
  234801111111109-12-2014 16:19:48769948208117040 2883253750.00  Payment

should actually be:

  2348011111111 09-12-2014 16:17:04 764030208117034 2883253 100.00  Payment
  2348011111111 09-12-2014 16:18:46 1088956908117033 2883253 400.00 Payment
  2348011111111 09-12-2014 16:19:48 769948208117040 2883253 750.00  Payment

Please is there a way to add a separator character at the end of each word.

That would solve this perfectly.

Charles Okwuagwu
  • 10,538
  • 16
  • 87
  • 157

1 Answers1

1

No sorry, this idea simply won't work.

There is no such thing as a 'word' in a PDF file, there is simply a sequence of character codes and positions. The txtwrite code goes to some lengths to try and reconstruct words by looking at the position of each piece of text, and the metrics of the fonts used, but there are no words in the original.

I don't claim this is perfect, if you'd like me to look at it you will need to supply the original file. Best solution is to open a bug report and attach the file to it.

This is still an area I'm looking at, for a different project (RTF output) so now is a good time to report it. I cannot guarantee being able to resolve it, but it may well simply be that the 'rebuild the page layout' code is being too simple-minded about the location of the text.

You can, however, get a lower level output, the XML-like output will give you each fragment of text individually, and its position on the page. You could use that information yourself to rebuild the content.

The default option tries to build a simple representation of the page by using space characters to reproduce the layout of the original, as far as possible, but I have no illusions that there aren't bugs :-)

KenS
  • 30,202
  • 3
  • 34
  • 51
  • Please can we access and test the new rtfwrite device? – Charles Okwuagwu Feb 27 '18 at 18:08
  • 1
    Not currently, its on a branch in my private repository because its nowhere near finished. Assuming I ever get it finished then it will be included in the master source code. – KenS Feb 27 '18 at 19:32