4

Problem Statement:

I have a PDF whose structure is like tables but the lines are not visible. Please find below the example:

Sample Table

The above image is how my table looks in one of the PDF pages.

My Research

  1. How to extract table as text from the PDF using Python? -- Gone through this question and seen all the answers. Not helpful

  2. Tabula: Tried tabula API but it is only extracting headers and not the text, probably because there are no lines.

  3. I can convert the whole pdf to text and then try to extract it with regex or data manipulations somehow. But that can be very tedious and time taking. Also, as the PDF changes the whole coding has to be done again.

Ask

Is their any API or Python package which can help me do this (Windows and Python 3.x)?

Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51
  • Sorry my bad. Should read; [how-to-extract-table-as-text-from-the-pdf-using-python](https://stackoverflow.com/questions/47533875/how-to-extract-table-as-text-from-the-pdf-using-python/47719296) – stovfl Sep 28 '18 at 12:51
  • Already done that...Have you even read my question completely.. this link is part of my research? – Rahul Agarwal Sep 28 '18 at 12:53
  • And this link does **Answer** your Question! Also without [mcve] this Question will put on **hold**. – stovfl Sep 28 '18 at 12:59
  • It doesn't as it my table has invisible boundary and all the answers in the link provided considers tables to have boundary. – Rahul Agarwal Sep 28 '18 at 13:17
  • Feel free to [edit] you Question with a [mcve] to show where you get stuck. – stovfl Sep 28 '18 at 13:31
  • Related: [extracting-text-from-a-pdf-file-using-python](https://stackoverflow.com/questions/34837707/extracting-text-from-a-pdf-file-using-python) – stovfl Sep 28 '18 at 13:37
  • did you figure it out? im not sure why the question was downvoted as is a legit inquiry. – El_1988 Sep 17 '19 at 15:36
  • 1
    @El_1988: There is potentially no generic solution to this!! What I have tried doing is read PDF through PyMuPdf or any other packages...see how they have broken the table and then write the code/logic to extract the relevant data!! – Rahul Agarwal Sep 18 '19 at 07:54
  • For anyone looking for the answer: https://stackoverflow.com/questions/53209335/python-camelot-borderless-table-extraction-issue might help – Zia Ul Rehman Mughal Dec 09 '19 at 05:50

3 Answers3

1

You need to use a package that gives you the x- and y-coordinates of text in the PDF. PyMuPDF or pdfminer would be my suggestions. You'll then need to programmatically determine what row and column each text block you come across is in.

J. Owens
  • 832
  • 7
  • 9
1

Try using Camelot and specify that your table has no lines like this:

tables = camelot.read_pdf('file.pdf', flavor = 'stream')

for more info refer to the documentation https://camelot-py.readthedocs.io/en/master/

odilo
  • 21
  • 3
1

I solved this problem via tabula-py

conda install tabula-py

and

>>> import tabula
>>> area = [70, 30, 750, 570] # Seems to have to be done manually
>>> page2 = tabula.read_pdf("nar_2021_editorial-2.pdf", guess=False, lattice=False, 
                 stream=True, multiple_tables=False, area=area, pages="all",
                   ) # `tabula` doc explains params very well
>>> page2

and I got this result

> 'pages' argument isn't specified.Will extract only from page 1 by default. [      
> ShortTitle                                              Text  \  0    
> Arena3Dweb         3D visualisation of multilayered networks     1    
> Aviator       Monitoring the availability of web services     2       
> b2bTools  Predictions for protein biophysical features and     3      
> NaN                                their conservation     4         
> BENZ WS          Four-level Enzyme Commission (EC) number     ..      
> ...                                               ...     68 
> miRTargetLink2              miRNA target gene and target pathway    
> 69             NaN                                          networks  
> 70       mmCSM-PPI            Effects of multiple point mutations on  
> 71             NaN                      protein-protein interactions  
> 72        ModFOLD8           Quality estimates for 3D protein models  
> 
>  
>                                                 URL    0                    http://bib.fleming.gr/Arena3D    1         
> https://www.ccb.uni-saarland.de/aviator    2                   
> https://bio2byte.be/b2btools/    3                                    
> NaN    4                 https://benzdb.biocomp.unibo.it/    ..       
> ...    68  https://www.ccb.uni-saarland.de/mirtargetlink2    69       
> NaN    70          http://biosig.unimelb.edu.au/mmcsm ppi    71       
> NaN    72       https://www.reading.ac.uk/bioinf/ModFOLD/      [73
> rows x 3 columns]]

This is an iterable obj, so you can manipulate it via for row in page2:

Hope it help you

zhangjq
  • 132
  • 1
  • 6