Extracting text from pdf using Python and Pypdf2

Question

I want to extract text from pdf file using Python and PYPDF package. This is my pdf fie and this is my code:

import PyPDF2
opened_pdf = PyPDF2.PdfFileReader('test.pdf', 'rb')

p=opened_pdf.getPage(0)

p_text= p.extractText()
# extract data line by line
P_lines=p_text.splitlines()
print P_lines

My problem is P_lines cannot extract data line by line and results in one giant string. I want to extract text line by line to analyze it. Any suggestion on how to improve it? Thanks! This is the string that code returns:

[u'Ingredient information for chemicals subject to 29 CFR 1910.1200(i) and Appendix D are obtained from suppliers Material Safety Data Sheets (MSDS)** Information is based on the maximum potential for concentration and thus the total may be over 100%* Total Water Volume sources may include fresh water, produced water, and/or recycled water0.01271%72.00%7732-18-5Water0.00071%4.00%1310-73-2Sodium Hydroxide0.00424%24.00%533-74-4DazomatBiocidePumpcoPlexcide 24L0.00828%75.00%Organic phosphonic acid salts0.00276%25.00%67-56-1Methyl AlcoholScale InhibitorPumpcoPlexaid 6730.00807%30.00%7732-18-5Water0.00188%7.00%Polyethoxylated alcohol surfactants0.00753%28.00%9003-06-9Ammonium Salts0.00941%35.00%64742-47-8Petroleum DistillateFriction ReducerPumpcoPlexslick 9210.05029%60.00%7732-18-5Water0.03353%40.00%7647-01-0Hydrogen ChlorideHydrochloric AcidPumpcoHCL9.84261%100.00%14808-60-7Crystaline SilicaProppantPumpcoSand90.01799%100.00%7732-18-5WaterCommentsMaximumIngredientConcentrationin HF Fluid(% by mass)**MaximumIngredientConcentrationin Additive(% by mass)**Chemical AbstractService Number(CAS #)IngredientsPurposeSupplierTrade NameHydraulic Fracturing Fluid Composition:2,608,032Total Water Volume (gal)*:7,595True Vertical Depth (TVD):GasProduction Type:NAD27Long/Lat Projection:32.558525Latitude:-97.215242Longitude:Ole Gieser Unit D 6HWell Name and Number:XTO EnergyOperator Name:42-439-35084API Number:TarrantCounty:TexasState:12/10/2010Fracture DateHydraulic Fracturing Fluid Product Component Information Disclosure']

Screenshot of the file:

this is what it returns: [u'Ingredient information for chemicals subject to 29 CFR 1910.1200(i) and Appendix D are obtained from suppliers Material Safety Data Sheets (MSDS)** Information is based on the maximum potential for concentration and thus the total may be over 100%* Total Water Volume sources may include fresh water, produced water, and/or recycled water0.01271%72.00%7732-18-5Water0.00071%4.00%1310-73-2Sodium Hydroxide0.00424%24.00%533-74-4DazomatBiocidePumpcoPlexcide 24L0.00828%.... — Amir, Mar 12 '17 at 02:35
add that string to the question its not very clear in the comment, also can you indicate where in the string you would expect the newline to occur — parsethis, Mar 12 '17 at 02:36
no it doesn't! I think since this pdf is generated from Excel, PYPDF2 has issues reading it and extracting text line by line — Amir, Mar 12 '17 at 02:45
Then that's the problem. `string.splitlines()` splits a string when it finds a newline. — AAM111, Mar 12 '17 at 02:47
ok so what do you suggest to use instead of splitlines()? thanks — Amir, Mar 12 '17 at 02:48
I added screenshot of pdf file. Also pdf file is attached to SKYDRIVE — Amir, Mar 12 '17 at 02:52
@Amir see my answer, since it has been two years, wondering if this solution works for you. Thanks — james-see, Mar 27 '19 at 13:38

Smart Manoj · Answer 1 · 2018-12-01T01:11:27.877

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text
print(convert_pdf_to_txt('test.pdf').strip().split('\n\n'))

Output

Hydraulic Fracturing Fluid Product Component Information Disclosure

Fracture Date State: County: API Number: Operator Name: Well Name and Number: Longitude: Latitude: Long/Lat Projection: Production Type: True Vertical Depth (TVD): Total Water Volume (gal)*:

12/10/2010 Texas Tarrant 42-439-35084 XTO Energy Ole Gieser Unit D 6H -97.215242 32.558525 NAD27 Gas 7,595 2,608,032

Hydraulic Fracturing Fluid Composition:

Trade Name

Supplier

Purpose

Ingredients

Chemical Abstract Service Number

(CAS #)

Maximum Ingredient

Concentration

in Additive ( by mass)**

Comments

Maximum Ingredient

Concentration

in HF Fluid ( by mass)**

Water Sand HCL

Pumpco Pumpco

Proppant Hydrochloric Acid

Plexslick 921

Pumpco

Friction Reducer

Plexaid 673

Pumpco

Scale Inhibitor

Plexcide 24L

Pumpco

Biocide

Crystaline Silica

Hydrogen Chloride Water

Petroleum Distillate Ammonium Salts Polyethoxylated alcohol surfactants Water

Methyl Alcohol Organic phosphonic acid salts

Dazomat Sodium Hydroxide Water

7732-18-5 14808-60-7

7647-01-0 7732-18-5

64742-47-8 9003-06-9

7732-18-5

67-56-1

533-74-4 1310-73-2 7732-18-5

100.00 100.00

90.01799 9.84261

40.00 60.00

35.00 28.00 7.00 30.00

25.00 75.00

24.00 4.00 72.00

0.03353 0.05029

0.00941 0.00753 0.00188 0.00807

0.00276 0.00828

0.00424 0.00071 0.01271

Total Water Volume sources may include fresh water, produced water, and/or recycled water ** Information is based on the maximum potential for concentration and thus the total may be over 100

Ingredient information for chemicals subject to 29 CFR 1910.1200(i) and Appendix D are obtained from suppliers Material Safety Data Sheets (MSDS)

I get following code when I run your code: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128) — Amir, Mar 12 '17 at 03:18
no I didn't manage to do it unfortunately I get the same error listed above. However, even if I code works, it is not going to be useful because it is not parsing line by line. thanks for your effort — Amir, Mar 14 '17 at 02:31
http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20 The character \xe9 is é — Smart Manoj, Mar 14 '17 at 13:00
for python 3 and above one have to use `io` instead of `cStringIO` — WiLL_K, Oct 01 '18 at 13:29

james-see · Answer 2 · 2020-05-31T12:52:34.660

3

textract works fine in python3, using the tesseract method. Example code:

import textract
text = textract.process("pdfs/testpdf1.pdf", method='tesseract')
print(text)
with open('textract-results.txt', 'w+') as f:
    f.write(str(text))

https://pypi.org/project/textract/

edited May 31 '20 at 12:52

answered Mar 26 '19 at 02:25

james-see

12,210
6
40
47

I think textract is not the right answer here. It's more like an enterprise solution. You can't access it without a credit card. Generally, people here are looking for open source solutions or some coding based solution whereas this is more like a blackbox solution which is also paid. So I think it would have been better if there were a disclaimer before the solution, something like "this might not be the right solution for everyone as it's more like an enterprise solution provided by aws". This answer wasted my time instead of saving it. So downvoted. – penduDev May 30 '20 at 16:49
@penduDev please see my update to my answer. At the time of my answer Amazon did not have a thing called Textract which is unfortunately the same name as an open source library in pypi. Please download and use this one: https://pypi.org/project/textract/ as it really does work that easy out of the box. And please remove the downvote, since your assumption about AWS / enterprise solution was wrong. I very much seek and use open source solutions and share them here and everywhere I can. – james-see Jun 04 '20 at 03:03
1

oh.. i misunderstood! thanks for clarification @jamescampbell ..My problem was solved another way, but hope someone else find this helpful. Thank You! :) – penduDev Jun 05 '20 at 07:35

score 0 · Answer 3 · answered Mar 12 '17 at 02:56

0

Make sure that the PDF you are importing actually has newlines in it. If it doesn't, then there is nowhere for p_text.splitlines() to split the string! If there is a specific character, you can use p_text.split("the linebreak character").

EDIT: Based on your PDF, I'm not sure there is a way to split this by line since it seems to be statically formatted rather than linearly. (text is placed by position in the PDF, not line-by-line).

answered Mar 12 '17 at 02:56

AAM111

1,178
3
19
39

Yes splitlines() doesn't work. Is there any other pdf extraction package than can accomplish this? – Amir Mar 12 '17 at 03:00
@Amir see my answer. – james-see May 26 '20 at 15:05

score 0 · Answer 4 · edited May 07 '21 at 06:23

Here's the function I came up with that was completely based on the @SmartManoj answer but has been updated to be cleaner (in my opinion) by using with statements, eliminating unnecessary variable (i.e., ones that the keyword argument self explains) as well as yielding the page's text.

from typing import Generator  
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def pages_as_txt(path) -> Generator[str, None, None]:
    rsrcmgr = PDFResourceManager()
    with StringIO() as retstr, TextConverter(rsrcmgr, retstr, codec='utf-8', laparams=LAParams()) as device:
        interpreter = PDFPageInterpreter(rsrcmgr, device)

        with open(path, 'rb') as fp:
            for page in PDFPage.get_pages(fp, check_extractable=False):
                interpreter.process_page(page)
                yield retstr.getvalue()
                retstr.truncate(0)
                retstr.seek(0)

Extracting text from pdf using Python and Pypdf2

4 Answers4

Linked