0

I need to parse a 4 pages file that contains train timetables.

example

The PDFBox problem: empty table cell = deleted!! :-(

Is it any way to make PDBBox assume that empty table cell = one special char/sequence?

Let's take an example:

-> station "Thann (A)"

-> I want to keep only the times if "Thann(D)" not empty... so I wouldn't keep 07.01!

-> how could I do this?

For now my app is working, I read the 4 pages of the PDF, and analyze the buffer data with a custom java class to get the data I need.

(I do it this way because with Android, there is a memory crash when I read the PDF twice or more... despite the fact that it works well with a standard java project!)

But this way, there are few times that I don't need to get because the next station is empty.

I would like to get for "Thann (A)":

06.01|06.30|06.21|07.01|(empty)|07.30

06.02|06.32|06.22|(empty)|07.03|07.33

AND NOT:

06.01|06.30|06.21|07.01|07.30

06.02|06.32|06.22|07.03|07.33

mkl
  • 90,588
  • 15
  • 125
  • 265
Steph68
  • 197
  • 4
  • 13
  • How do you currently go about parsing? That is, what methods do you use for reading specific cells and how do you use them and their return values to obtain your negative example? – Jan D Sep 07 '15 at 13:59
  • 2
    [*This answer*](http://stackoverflow.com/a/28370692/1729265) may be helpful. It extends the PDFBox `PDFTextStripper` to return text lines which attempt to reflect the PDF file layout, adding space characters to represent gaps like your empty cells. You may have to find your own optimum `fixedCharWidth` value... – mkl Sep 07 '15 at 15:31
  • Jan: I read the whole file (String Text = pdfStripper.getText(pdDoc);), no specific cells treatment – Steph68 Sep 07 '15 at 16:29
  • Mkl: ok, will have a look, thanks ;-) – Steph68 Sep 07 '15 at 16:30
  • Ok, it works well, I have my empty columns in my String buffer, thanks! BUT, now, how make the difference between space char=empty column and space char=real space char, within the String buffer? In the class LayoutTextStripper, couldn't I replace the space char=empty column with a special char/sequence, for instance, "ß"=empty column? Where to do it? – Steph68 Sep 08 '15 at 12:07
  • @mkl Your link is dead – beldaz Oct 11 '17 at 01:02
  • 1
    @beldaz Yes, the OP of that question had deleted his question and my answer with it. But I've meanwhile copied the answer [here](https://stackoverflow.com/a/45842515/1729265). – mkl Oct 11 '17 at 04:38
  • 1
    Thanks @mkl - Really annoying when a good answer gets nuked by OP deleting a question. – beldaz Oct 11 '17 at 05:16

0 Answers0