I am working on a project where I have about a thousand word files or PDFs. In these documents there's a specific table I want to extract. In the heading or the text of the document I should have the word results and I want to extract the table after that. I looked at libraries both for PDFs and docx but I failed to figure out a way to first read the word then extract the table. I already found a library called camelot for PDFs that extract tables but it extracts all of them. Is there a way I can only extract my choosen table as there is no way for me to distinguish my choosen table if I extract all tables. Anyone can help me with finding an appropriate library/s or method to only read the one table.
Asked
Active
Viewed 54 times
0
-
Yes this can be done with docx files. You state you want to search for the word 'results' (case insensitive?) in a Header or paragraph text? but not actually in a table [header] and extract the next table even if there may be another paragraph in between or a table must be the next block in order for it to be extracted? – moken Jan 03 '23 at 13:31
-
I also use camelot for a project and you can specify pages of pdf or define a specific area using coordinates to extract tables from there. Or you can filter tables you want to distinguish after you extract them using their extracted data or extraction report like row/column number of table or accuracy of extraction etc. Without having those data it's not possible to distinguish any table. I have no experience with docx. – Said Akyuz Jan 10 '23 at 13:50