Can't fetch only the names from a table located in a pdf file from a webpage

Question

I've created a script in python using requests module and PyPDF2 library to parse the pdf content from a website. I'm only interested in the name in column A under Facility Name available in page 4 (tabular content) in that pdf file. My script can scrape the content from that page but I can't find any way to get only the names and nothing else.

pdf file link that I've used within the script

This is how the table looks like

I'm only interested in the names under the column header Facility Name.

I've tried with:

import io
import PyPDF2
import requests

URL = 'https://www.cms.gov/Medicare/Provider-Enrollment-and-Certification/CertificationandComplianc/Downloads/SFFList.pdf'

res = requests.get(URL)
f = io.BytesIO(res.content)
reader = PyPDF2.PdfFileReader(f)
contents = reader.getPage(3).extractText()
print(contents)

Output I'm having right now are like:

Facilit
y Name
Address
City
State
Zip
Phone 
Number
Months as an 
SFFWillows Center
320 North Crawford Street
Willows
CA95988530-934-2834
5Winter Park Care & Rehabilitation Center
2970 Scarlett Rd
Winter Park
FL32792407-671-8030
and so on -----

Output I wish to have like:

Willows Center
Winter Park Care & Rehabilitation Center
Pinehill Nursing Center
River Brook Healthcare Center

How can I get only the names available in a table from a pdf file?

abdusco · Accepted Answer · 2019-07-19T08:15:48.600

1

Unfortunately for you PDF is not a structured document, it's just strings/images placed on coordinates to look exactly as it's created regardless of which program renders it. This means you cannot parse it as easy as HTML, because tables are not under a <table> element, but scattered across a page.

See:

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?
How to extract data from a PDF file while keeping track of its structure?

Take a look at https://github.com/atlanhq/camelot, it might help you

(There's at most 10 pages there with a table, going manual might be a faster option here, unless you have many PDFs like this.)

edited Jul 19 '19 at 08:15

answered Jul 19 '19 at 08:09

abdusco

9,700
2
27
44

So, there is no such library out there to do the trick easily, right? – robots.txt Jul 19 '19 at 19:43
Camelot seems promising, but no, it's not easy – abdusco Jul 19 '19 at 19:46

Can't fetch only the names from a table located in a pdf file from a webpage

1 Answers1