0

I have been having a serious problem with my PDF file. I want to extract all the text from my PDF. After extraction, I have all of it in byte code.

You can see below an extracted part of the extracted text:

b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(en-US) /Metadata 89 0 R/ViewerPreferences 90 0 R>>\r\nendobj\r\n2 0 obj\r\n<</Type/Pages/Count 11/Kids[ 3 0 R 28 0 R 36 0 R 38 0 R 42 0 R 49 0 R 58 0 R 60 0 R 62 0 R 64 0 R 66 0 R] >>\r\nendobj\r\n3 0 obj\r\n<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 9 0 R/F3 12 0 R/F4 17 0 R/F5 19 0 R>>/ExtGState<</GS7 7 0 R/GS8 8 0 R>>/XObject<</Image27 27 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/Annots[ 11 0 R 24 0 R 25 0 R 26 0 R] /MediaBox[ 0 0 612 792] /Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S>>\r\nendobj\r\n4 0 obj\r\n<</Filter/FlateDecode/Length 5962>>\r\nstream\r\nx\x9c\xc5][o\xe3\xc6\x92~\x1f`\xfeC?J\x81\x87!\xbby\x1d\x1c,0\x17\'9\x07\xc9\\l\x03\xd9 \xc9\x03-\xd1\x16weI!9\xe3\xf1\xbf\xdf\xfa\xaa\x9b\x17\x89\xa4\xec\x91Z\xde\x01\xac\x91\xa8&\xab\xba\xaa\xba\xee\xdd\xfa\xe7\xe5\x0b\xd7q\xf1/\xf1\xa4pEH\xafQ"E\x91\xbd|\xf1\xfb\x0fb\xf5\xf2\xc5\xdb\xab\x97/~\xfc\xc9\x13\x9e\xe7\xb8\xbe\xb8\xbay\xf9\xc2\xa3q\xae\xf0\x84\x1f\x06\x8e\xa4\xe1A\xe2$\xa1\xb8\xba\xa3q?_F\xe2\xb6\xa4g\x8a[\xfe\x14\x9bO?\xbf|\xf1\xe7\xe4\xd7\xe9+5I\xcbJ\xe0\xff/S5\xd9\xd0\xdf\x9c\xfe\xd2j\xea\xb9\x93l\xfeZL\xff\x16W\xffy\xf9\xe2\x9c`~~\xf9\xe2\x9f#\x90\x0bd\xec\x04q\x179\xc6\xc9\xa0\xa2\x80\xc2\x8f\xd3P\xbfq\xa7\x11}x\xe5O$\xbd\xc1\x07\x0fWc\x8b\xc8D\xa1\xe3\xc91d\xbe{\xd6z\x90r\x9d\xd8\x17a(\x9d\xc8\x17^\xec9I$\x12\xfa@\x17\xdb\xa1O\x1d\xa7q\x97\x82`u\x11W\xa1\x88|\x1f\xb8?\x8e\xf4\xe7\xfa\x8d\xf4\x94#\x93\x1a\xa2\nb\xc7U\x83\x98=m`\x83Z\xc0\xc4\xeb`\'\xbd\xd8\xf1\x03\xc2\xd0ud\xdc\xc3\xf0\xb7\xacJ\xb5t\xa5\xd3Wr2

The code for this is a follows:

url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url, stream=True)
data = response.content

print(data)

How can I extract the text from this?

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • 2
    You say 'after extraction' but as far as I can see you've just downloaded the file and not actually tried to extract the text. Maybe this question? [Extracting text from a PDF file using Python](https://stackoverflow.com/q/34837707) – Rup Jun 06 '18 at 15:50
  • Welcome to StackOverflow! If your question is **not** a duplicate of that, please edit your question and make it clear what you're looking for. Meanwhile, please take the [tour](https://stackoverflow.com/tour) and read [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) Your best bet here is to do your research, [search](https://stackoverflow.com/help/searching) for related topics on SO, and give it a go. Good luck! – Jeff Learman Jun 06 '18 at 16:03

1 Answers1

5

You would need to use a package to parse the PDF file and extract the text from it. For example PyPDF2 could be used as follows:

import io
import requests
import PyPDF2

url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url, stream=True)
pdf = PyPDF2.PdfFileReader(io.BytesIO(response.content))

with open('output.txt', 'w') as f_output:
    for page in range(pdf.getNumPages()):
        f_output.write(pdf.getPage(page).extractText())

This would create an output.txt file starting:

Last updated: 
3/30/2018


Metadata: 
Tivoli Bay 
South

Hydrologic

Station

Location: 
Tivoli Bay
, NY
(
42.027038, 
-
73.925957
)

Data collection period:

July

1996*
Martin Evans
  • 45,791
  • 17
  • 81
  • 97