how to write code to extract a specific text and integer on the same line from a pdf file using python?

Question

The below is the data I am having in a pdf file where I would like to extract the integer number 100 in the line "US stock price 100" using Keyword as "US stock price" using python?

****PDF FILE LINES BELOW*****

sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. 
Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? 
Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur
US stock price     100
"Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, 
totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. 
Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. 
Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, 
Abb price     50

Below is the code i have used for the text extraction:

import PyPDF2
pdfFileObject = open(path, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    Text=page.extractText()
    print(Text)

Please show us what you've tried. Knowing if you're having an issue with getting data from a PDF, the regular expression, or some other part of the process will allow us to help you with the part you're stuck on. — Cohan, Nov 09 '18 at 18:36
Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. [On topic](http://stackoverflow.com/help/on-topic), [how to ask](http://stackoverflow.com/help/how-to-ask), and [... the perfect question](https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) apply here. StackOverflow is not a design, coding, research, or tutorial resource. However, if you follow whatever resources you find on line, make an honest coding attempt, and run into a problem, you'd have a good example to post. — Prune, Nov 09 '18 at 18:43

score 1 · Answer 1 · answered Nov 09 '18 at 18:58

1

You can try using the package tika.

from tika import parser

raw = parser.from_file('test.pdf')
print(raw['myText'])

answered Nov 09 '18 at 18:58

Mayank Porwal

33,470
8
37
58

score 0 · Answer 2 · answered Nov 09 '18 at 18:56

0

Below is the code to search for the keyword in PDF file.

import PyPDF2
import re

object = PyPDF2.PdfFileReader("test.pdf")
numPages = object.getNumPages()
string = "US stock price"
for i in range(0, numPages):
    pageObj = object.getPage(i)
    print("this is page " + str(i)) 
    txt = pageObj.extractText() 
    resSearch = re.search(string, txt)
    print(resSearch)

answered Nov 09 '18 at 18:56

Venkat

1
2

unable to extract the integer 100, by giving string = "US stock price" – Chandra Sekhar Nov 09 '18 at 19:08

how to write code to extract a specific text and integer on the same line from a pdf file using python?

2 Answers2

Linked