0

I have a data frame as below, it has 500+ rows and I am only showing a sample. The column URL has links to PDFs on web. I would like to open each pdf and copy content of pdfs into new column PDF data. I understand that some of the PDFs could be very long and amount of text in that column could be huge in some cases.

For example, in case of the first row, I would like to copy content of URL 'https://www.occ.gov/static/enforcement-actions/ea2018-001.pdf' into the column PDF data.

In case of the second row, the PDF data would be empty .

In case of the third row, the PDF data would have content of PDF ''https://www.occ.gov/static/enforcement-actions/ea2017-104.pdf''

I came across this URL that works with PDFs, but it requires all pdfs downloaded into a single folder and it's output is a folder that has txt files. But I would like to have contents of pdfs into a column of data. Moreover I have 500+ rows and I won't be able to download a pdf at a time.

import pandas as pd
import numpy as np

sales = [{'account': 'credit cards', 'Jan': '150 jones', 'Feb': '200 .jones', 'URL': 'https://www.occ.gov/static/enforcement-actions/ea2018-001.pdf'},
         {'account': '1',  'Jan': 'Jones', 'Feb': '210', 'URL': ''},
         {'account': '1',  'Jan': '50',  'Feb': '90',  'URL': 'https://www.occ.gov/static/enforcement-actions/ea2017-104.pdf' }]
df = pd.DataFrame(sales)
Ni_Tempe
  • 307
  • 1
  • 6
  • 20
  • 1
    What have you tried so far? Take a look at [requests](http://docs.python-requests.org) for accessing network resources. – ti7 Feb 12 '18 at 19:16

1 Answers1

2

I don't know of any good way to extract text from a pdf without downloading it first, and found this answer that says something similar. However, if you use requests to download the file, you can then use any number of tools to extract the text. For example, PyMuPDF makes it pretty easy to extract the text of a pdf as one long string (docs here).

In order to actually add the extracted text to a new column in your dataframe, you could do something like this:

def pdf_text_extractor(url):
    # code to download pdf
    # code to extract text from pdf
    return pdf_text

df.assign(pdf_text = df['URL'].apply(pdf_text_extractor))
ZaxR
  • 4,896
  • 4
  • 23
  • 42