extraction of specific text from pdf using python

Question

is it possible to extract specific text from the pdf using python.

test case:I have a PDF file of more than 10pages, I need to extract the specific text and the value associated with them. example: user:value user id:value. These values need to be extracted.

I was able to read all the pages, I want specific text now

Does this answer your question? [How to extract text from pdf in python 3.7.3](https://stackoverflow.com/questions/55767511/how-to-extract-text-from-pdf-in-python-3-7-3) — avocadoLambda, May 10 '20 at 09:58
As a new user, please also take the [tour] and read [ask]. In particular, questions that can be answered with yes or no are usually bad questions. — Ulrich Eckhardt, May 10 '20 at 10:33
You may transform PDF to XML or to json and then use a lib-xml library or json library in order to extract whatever you want from it. — Catalina Chircu, May 10 '20 at 10:43

score 0 · Answer 1 · answered May 10 '20 at 10:35

If you are already able to read the PDF and store the text into a string, you could do the following:

import re # Import the Regex Module

pdf_text = """
user:John
user:Doe
user id:2
user id:4
"""

# re.findall will create a list of all strings matching the specified pattern
results = re.findall(r'user:\s\w+', pdf_text)
results = ['user: John', 'user: Doe']

This basically means: find all matches that start with the string 'user:', followed by a whitespace '\s' and then followed by characters that form words (letters and numbers) '\w' until it cannot match anymore '+'.

If you would only like to get the "value" field back, you could use: r'user:\s(\w+)' which would instruct the regex engine to group the string matched by '\w+'. If you have groups in your regex pattern, findall return a list of the group matches instead, so the result would be:

results = re.findall(r'user:\s(\w+)', pdf_text)
['John', 'Doe']

Take a look at the regex module documentation at: https://docs.python.org/3/library/re.html

Some other methods like finditer() could also help in case you want to do more complex stuff.

This regex guide could also be of help: https://www.regexbuddy.com/regex.html?wlr=1

extraction of specific text from pdf using python

1 Answers1