0

I have extracted without problem table from PDF with Camelot because in my table the columns are very good separated with spaces. In order to filter some unwanted rows, I have a filter that delete all the rows that doesn't have a number in the first column. But sometimes, I don't know why, the generated PDF (coming always by the same web server) introduce, only in the first and second rows and between the first and the second column an escape character \n. The PDF looks good but my filter delete those rows because it doesn't detect a number in first column.

Output of Camelot if PDF doesn't introduce \n (only the first two rows)

      0         1                 2         3     4       5       6           7 

0   Pos.   Art-Nr.       Bezeichnung     Menge         Preis  Rabatt       Summe
1    68   10.30.42            Dimmer         1   Stk     100   10.0%        90.0

Output of Camelot if PDF introduce \n

               0                  1         2     3       4       5           6  

0   Pos.\nArt-Nr.       Bezeichnung     Menge         Preis  Rabatt       Summe
1   68\n10.30.42             Dimmer         1   Stk     100   10.0%        90.0

So the 68\n10.30.42 is not seen as a number and the row will be deleted

My code:

camelot_df = (camelot.read_pdf(input_pdf,
    flavor="stream",
    suppress_stdout=True,
    pages="all"))

pdf_df = pd.DataFrame()

for pages in camelot_df
    pages.df = pages.df[pages.df[0].str.isdigit()]
    if (~pages.df.empty):
        pdf_df=pdf_df._append(pages.df)

I have tried to pass to Camelot strip_text=' \n'

or modify the PDF before using Camelot with

raw = parser.from_file(input_pdf_file_inverter)
content = raw['content']
content = content.replace("\n", " ")
  • `camelot.read_pdf` is deprecated, see: https://stackoverflow.com/questions/74939758/camelot-deprecationerror-pdffilereader-is-deprecated, and question might be hard to reproduce when this problem only is with certain PDF's? – Luuk Aug 05 '23 at 09:42
  • Very strange because the official documentation of Camelot [link](https://camelot-py.readthedocs.io/en/master/) still use this method – giancarlo64 Aug 06 '23 at 09:42
  • Hm, I just installed it, and got that message about being deprecated. When reading the complete [issue on Github](https://github.com/camelot-dev/camelot/issues/339) it seems to be solved already.... – Luuk Aug 06 '23 at 09:51

0 Answers0