Web scraping in python converting pdf file to txt file

Question

I tried several ways to get Fed press conference transcrips ( a PDF format) and convert it to a .txt file, but fail. Below is my original code. Any suggestion will be highly appreciated.

import csv
from bs4 import BeautifulSoup
import requests

source=requests.get('https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm').text
soup=BeautifulSoup(source,'lxml')

for b in soup.find_all("a",href=True):
    if b.text=='Press Conference':
        lnk='https://www.federalreserve.gov'+b['href']
        source2=requests.get(lnk).text
        soup2=BeautifulSoup(source2,'lxml')
        for c in soup2.find_all("a",href=True):
            if 'Press Conference Transcript'in c.text:
                lnk2='https://www.federalreserve.gov'+c['href']
                source3=requests.get(lnk2).text
                soup3=BeautifulSoup(source3,'lxml')
                for d in soup3.find_all("div",attrs={"id","content"}):
                    print(d)
                    fileout = open('conf.txt', 'a')
                    fileout.write(d)

_I tried several ways to get Fed press conference transcrips ( a PDF format) and convert it to a .txt file, but fail._ In what way does it fail, what happens? Which part of the code is responsible for the conversion? — AMC, Jun 26 '20 at 19:58
After I get the PDF link from lnk2 (for example https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20151216.pdf), it fails to get the text content of the pdf file starting from the code "source3=requests.get(lnk2).text" — Grace, Jun 29 '20 at 14:56

score 0 · Answer 1 · answered Jun 26 '20 at 20:34

0

So regarding the PDF Scraping I came up with the following:

import requests
import io
import PyPDF2

# Donwload PDF
URL = 'https://www.federalreserve.gov/monetarypolicy/files/monetary20200129a1.pdf'
pdf_bytes = requests.get(URL).content
# PDF Reader expects a file-like object
pdf_stream = io.BytesIO(pdf)
reader = PyPDF2.PdfFileReader(pdf_stream)
# Read the first page
page = reader.getPage(0)
page_content = page.extractText()
print(page_content.encode('utf-8'))

also it might be worth looking at How to extract text from a PDF file?

answered Jun 26 '20 at 20:34

AlexNe

926
6
22

Thank you for your response, It works with your URL, but it does not work with URL='https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20151216.pdf', which is one of URL I want to convert it to txt . really do not know why? – Grace Jun 27 '20 at 04:19
Then you should have specified that url in your question. – AlexNe Jun 27 '20 at 05:20

yadayada · Answer 2 · 2020-06-26T21:00:40.993

0

Just a recommendation if you are stuck with that to checkout library pyPDF2. Very easy to use if your PDF is well-formed. Code example will look simple, like so:

    from PyPDF2 import PdfFileReader

    def extract_information(pdf_path):
       with open(pdf_path, 'rb') as f:
         pdf = PdfFileReader(f)
         information = pdf.getDocumentInfo()
         number_of_pages = pdf.getNumPages()

PDFMiner is a good one too.

This article from RealPython blog a little old but also a good source of information

edited Jun 26 '20 at 21:00

answered Jun 26 '20 at 20:49

yadayada

327
1
3
14

Thanks for your response, it is simple, but I still does not work for getting the pdf file from 'https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20151216.pdf', this is the pdf_path I want to convert it to txt – Grace Jun 27 '20 at 04:11

Web scraping in python converting pdf file to txt file

2 Answers2