0

I want to download pdf files from a website and work with the text. But, I don't want to create a pdf file and then convert it to text. I use python request. Is there any way to get the text directly after the following code?

res = requests.get(url, timeout=None)

damoun rabie
  • 11
  • 1
  • 3
  • 1
    Possible duplicate of [Extracting text from a PDF file using Python](https://stackoverflow.com/questions/34837707/extracting-text-from-a-pdf-file-using-python) – phd Nov 12 '17 at 22:08
  • 1
    I'd say it isn't a duplicate of ^, because OP is asking "Can I do this...?" And the answer is no. – cs95 Nov 12 '17 at 23:24

3 Answers3

4

AFAIK, you will have to at least create a temp file so that you can perform your process.

You can use the following code which takes / reads a PDF file and converts it to a TEXT file. This makes use of PDFMINER and Python 3.7.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter,TextConverter,XMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import io

def convert(case,fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)
    manager = PDFResourceManager()
    codec = 'utf-8'
    caching = True
    output = io.StringIO()
    converter = TextConverter(manager, output, codec=codec, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)
    infile = open(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums, caching=caching, check_extractable=True):
        interpreter.process_page(page)

    convertedPDF = output.getvalue()
    print(convertedPDF)

    infile.close()
    converter.close()
    output.close()
    return convertedPDF

Main function to call the above program:

import os
import converter
import sys, getopt

class ConvertMultiple:
    def convert_multiple(pdf_dir, txt_dir):
        if pdf_dir == "": pdf_dir = os.getcwd() + "\\"  # if no pdfDir passed in
        for pdf in os.listdir(pdf_dir):  # iterate through pdfs in pdf directory
            print("File name is %s", os.path.basename(pdf))
            file_extension = pdf.split(".")[-1]
            print("file extension is %s", file_extension)
            if file_extension == "pdf":
                pdf_file_name = pdf_dir + pdf
                path = 'E:/pdf/' + os.path.basename(pdf)
                print(path)
                text = converter.convert('text', path)  # get string of text content of pdf
                text_file_name = txt_dir + pdf + ".txt"
                text_file = open(text_file_name, "w")  # make text file
                text_file.write(text)  # write text to text file


pdf_dir = "E:/pdf"
txt_dir = "E:/text"
ConvertMultiple.convert_multiple(pdf_dir, txt_dir)

Of course you can tune it some more and may be some more room for improvement, but this thing certainly works.

Just make sure instead of providing pdf folder provide a temp pdf file directly.

Hope this helps you..Happy Coding!

illusionx
  • 3,021
  • 1
  • 11
  • 17
2

PyPDF2 works fine If all you want is the text

Install the PyPDF2 package https://pypi.org/project/PyPDF2/ on anaconda terminal (or) cmd prompt

pip install PyPDF2

You can use the following code which takes/reads a PDF file and converts it to a TEXT file

import PyPDF2
from PyPDF2 import PdfFileReader, PdfFileWriter
def getTextPDF(pdfFileName,password=''):
    pdf_file=open(pdfFileName,'rb')
    read_pdf=PyPDF2.PdfFileReader(pdf_file)
    if password !='':
        read_pdf.decrypt(password)
    text=[]
    for i in range(0,read_pdf.getNumPages()):
        text.append(read_pdf.getPage(i).extractText())
    return ('\n'.join (text).replace("\n",''))


getText2PDF('0001.pdf')

Works great for me

thrinadhn
  • 1,673
  • 22
  • 32
1

If your pdf file is in AWS S3(Simple Storage Service), Pass the Unsigned URL.

import boto3 
from PyPDF2 import PdfFileReader 
from io import BytesIO


def extract_PDF(url): #URL where the pdf is stored online

    CF="https://<Bucket_name>.<Website>.com/"
    object_name = url.replace(CF,'')
    bucket_name="<Bucket_name>.<Website>.com"

    s3 = boto3.resource('s3')
    obj = s3.Object(bucket_name, object_name)
    fs = obj.get()['Body'].read()
    pdfFile = PdfFileReader(BytesIO(fs))

    text=""
    for page_no in range(len(pdfFile.pages)):
        page = pdfFile.getPage(page_no)
        text += page.extractText()
    text = text.replace('\n','')
    text = text.replace('  ','')
    return text
Krooz
  • 31
  • 3
  • 1
    Probably more helpful to this question to drop anything regarding S3, which confuses what’s relevant, and rewrite this to request a regular URL, per the original question that uses the `requests.get()` method. – jeffbyrnes Feb 23 '20 at 16:54