I have a data frame as below, it has 500+ rows and I am only showing a sample. The column URL
has links to PDFs on web. I would like to open each pdf and copy content of pdfs into new column PDF data
. I understand that some of the PDFs could be very long and amount of text in that column could be huge in some cases.
For example,
in case of the first row, I would like to copy content of URL 'https://www.occ.gov/static/enforcement-actions/ea2018-001.pdf' into the column PDF data
.
In case of the second row, the PDF data
would be empty .
In case of the third row, the PDF data
would have content of PDF ''https://www.occ.gov/static/enforcement-actions/ea2017-104.pdf''
I came across this URL that works with PDFs, but it requires all pdfs downloaded into a single folder and it's output is a folder that has txt files. But I would like to have contents of pdfs into a column of data. Moreover I have 500+ rows and I won't be able to download a pdf at a time.
import pandas as pd
import numpy as np
sales = [{'account': 'credit cards', 'Jan': '150 jones', 'Feb': '200 .jones', 'URL': 'https://www.occ.gov/static/enforcement-actions/ea2018-001.pdf'},
{'account': '1', 'Jan': 'Jones', 'Feb': '210', 'URL': ''},
{'account': '1', 'Jan': '50', 'Feb': '90', 'URL': 'https://www.occ.gov/static/enforcement-actions/ea2017-104.pdf' }]
df = pd.DataFrame(sales)