Extract table with invisible lines from PDF

Question

Problem Statement:

I have a PDF whose structure is like tables but the lines are not visible. Please find below the example:

The above image is how my table looks in one of the PDF pages.

My Research

How to extract table as text from the PDF using Python? -- Gone through this question and seen all the answers. Not helpful
Tabula: Tried tabula API but it is only extracting headers and not the text, probably because there are no lines.
I can convert the whole pdf to text and then try to extract it with regex or data manipulations somehow. But that can be very tedious and time taking. Also, as the PDF changes the whole coding has to be done again.

Ask

Is their any API or Python package which can help me do this (Windows and Python 3.x)?

Sorry my bad. Should read; [how-to-extract-table-as-text-from-the-pdf-using-python](https://stackoverflow.com/questions/47533875/how-to-extract-table-as-text-from-the-pdf-using-python/47719296) — stovfl, Sep 28 '18 at 12:51
Already done that...Have you even read my question completely.. this link is part of my research? — Rahul Agarwal, Sep 28 '18 at 12:53
And this link does **Answer** your Question! Also without [mcve] this Question will put on **hold**. — stovfl, Sep 28 '18 at 12:59
It doesn't as it my table has invisible boundary and all the answers in the link provided considers tables to have boundary. — Rahul Agarwal, Sep 28 '18 at 13:17
Feel free to [edit] you Question with a [mcve] to show where you get stuck. — stovfl, Sep 28 '18 at 13:31
Related: [extracting-text-from-a-pdf-file-using-python](https://stackoverflow.com/questions/34837707/extracting-text-from-a-pdf-file-using-python) — stovfl, Sep 28 '18 at 13:37
did you figure it out? im not sure why the question was downvoted as is a legit inquiry. — El_1988, Sep 17 '19 at 15:36
@El_1988: There is potentially no generic solution to this!! What I have tried doing is read PDF through PyMuPdf or any other packages...see how they have broken the table and then write the code/logic to extract the relevant data!! — Rahul Agarwal, Sep 18 '19 at 07:54
For anyone looking for the answer: https://stackoverflow.com/questions/53209335/python-camelot-borderless-table-extraction-issue might help — Zia Ul Rehman Mughal, Dec 09 '19 at 05:50

score 1 · Answer 1 · answered Sep 28 '18 at 22:02

1

You need to use a package that gives you the x- and y-coordinates of text in the PDF. PyMuPDF or pdfminer would be my suggestions. You'll then need to programmatically determine what row and column each text block you come across is in.

answered Sep 28 '18 at 22:02

J. Owens

832
7
9

score 1 · Answer 2 · answered Nov 30 '20 at 11:33

1

Try using Camelot and specify that your table has no lines like this:

tables = camelot.read_pdf('file.pdf', flavor = 'stream')

for more info refer to the documentation https://camelot-py.readthedocs.io/en/master/

answered Nov 30 '20 at 11:33

odilo

21
3

zhangjq · Answer 3 · 2021-09-04T22:07:19.973

I solved this problem via tabula-py

conda install tabula-py

and

>>> import tabula
>>> area = [70, 30, 750, 570] # Seems to have to be done manually
>>> page2 = tabula.read_pdf("nar_2021_editorial-2.pdf", guess=False, lattice=False, 
                 stream=True, multiple_tables=False, area=area, pages="all",
                   ) # `tabula` doc explains params very well
>>> page2

and I got this result

> 'pages' argument isn't specified.Will extract only from page 1 by default. [      
> ShortTitle                                              Text  \  0    
> Arena3Dweb         3D visualisation of multilayered networks     1    
> Aviator       Monitoring the availability of web services     2       
> b2bTools  Predictions for protein biophysical features and     3      
> NaN                                their conservation     4         
> BENZ WS          Four-level Enzyme Commission (EC) number     ..      
> ...                                               ...     68 
> miRTargetLink2              miRNA target gene and target pathway    
> 69             NaN                                          networks  
> 70       mmCSM-PPI            Effects of multiple point mutations on  
> 71             NaN                      protein-protein interactions  
> 72        ModFOLD8           Quality estimates for 3D protein models  
> 
>  
>                                                 URL    0                    http://bib.fleming.gr/Arena3D    1         
> https://www.ccb.uni-saarland.de/aviator    2                   
> https://bio2byte.be/b2btools/    3                                    
> NaN    4                 https://benzdb.biocomp.unibo.it/    ..       
> ...    68  https://www.ccb.uni-saarland.de/mirtargetlink2    69       
> NaN    70          http://biosig.unimelb.edu.au/mmcsm ppi    71       
> NaN    72       https://www.reading.ac.uk/bioinf/ModFOLD/      [73
> rows x 3 columns]]

This is an iterable obj, so you can manipulate it via for row in page2:

Hope it help you

This response helped me so immensely. I wish I could treat you to a fine dinner. — Dance Party, Aug 12 '21 at 02:31
@DanceParty Happy to hear about that. So please vote me up because I'm new to this community and need some reputation. Thanks~ — zhangjq, Mar 18 '22 at 06:11
how would be able to achieve this is the area (bounding boxed I guess) are not known, and cant be done manually? — Lidor Eliyahu Shelef, Oct 26 '22 at 06:51

Extract table with invisible lines from PDF

3 Answers3