I have extracted without problem table from PDF with Camelot because in my table the columns are very good separated with spaces. In order to filter some unwanted rows, I have a filter that delete all the rows that doesn't have a number in the first column. But sometimes, I don't know why, the generated PDF (coming always by the same web server) introduce, only in the first and second rows and between the first and the second column an escape character \n. The PDF looks good but my filter delete those rows because it doesn't detect a number in first column.
Output of Camelot if PDF doesn't introduce \n (only the first two rows)
0 1 2 3 4 5 6 7
0 Pos. Art-Nr. Bezeichnung Menge Preis Rabatt Summe
1 68 10.30.42 Dimmer 1 Stk 100 10.0% 90.0
Output of Camelot if PDF introduce \n
0 1 2 3 4 5 6
0 Pos.\nArt-Nr. Bezeichnung Menge Preis Rabatt Summe
1 68\n10.30.42 Dimmer 1 Stk 100 10.0% 90.0
So the 68\n10.30.42 is not seen as a number and the row will be deleted
My code:
camelot_df = (camelot.read_pdf(input_pdf,
flavor="stream",
suppress_stdout=True,
pages="all"))
pdf_df = pd.DataFrame()
for pages in camelot_df
pages.df = pages.df[pages.df[0].str.isdigit()]
if (~pages.df.empty):
pdf_df=pdf_df._append(pages.df)
I have tried to pass to Camelot strip_text=' \n'
or modify the PDF before using Camelot with
raw = parser.from_file(input_pdf_file_inverter)
content = raw['content']
content = content.replace("\n", " ")