I've created a script in python using requests
module and PyPDF2
library to parse the pdf content from a website. I'm only interested in the name in column A
under Facility Name
available in page 4 (tabular content) in that pdf file. My script can scrape the content from that page but I can't find any way to get only the names and nothing else.
pdf file link that I've used within the script
This is how the table looks like
I'm only interested in the names under the column header Facility Name
.
I've tried with:
import io
import PyPDF2
import requests
URL = 'https://www.cms.gov/Medicare/Provider-Enrollment-and-Certification/CertificationandComplianc/Downloads/SFFList.pdf'
res = requests.get(URL)
f = io.BytesIO(res.content)
reader = PyPDF2.PdfFileReader(f)
contents = reader.getPage(3).extractText()
print(contents)
Output I'm having right now are like:
Facilit
y Name
Address
City
State
Zip
Phone
Number
Months as an
SFFWillows Center
320 North Crawford Street
Willows
CA95988530-934-2834
5Winter Park Care & Rehabilitation Center
2970 Scarlett Rd
Winter Park
FL32792407-671-8030
and so on -----
Output I wish to have like:
Willows Center
Winter Park Care & Rehabilitation Center
Pinehill Nursing Center
River Brook Healthcare Center
How can I get only the names available in a table from a pdf file?