0

I'm trying to extract few data from txt file(actually it's pdf file but i couldn't find a way to extract data from pdf so firstly i convert .pdf to .txt) but in that way this is a bit confusing. Are there better way to do that maybe module or something.

with open("example.txt","r") as f:
    for i in f.readlines():
        strings = i.split(" ")
        for item in strings:
            if item == "Price":
                order=strings.index("Price")   #i found the index of price
                real_price = strings[order+1]  #then i took the info that i look for
    print(f"Price is {real_price}")
#Price 12,90 that's how looks like in file

1 Answers1

0

I used a regular expression to extract what you want. Check this out.

import os
import re

fname = 'example.txt'
path = './'
fpath = os.path.join(path, fname)

regex = r'[pP]rice ([\d,]+)'
# read file
with open(fpath, mode='r') as txt_file:
    for line in txt_file.readlines():
        
        # remove leading/trailing characters
        line = line.strip()
        result = re.search(regex, line)

        # if result is not None
        if result:
            price = result.groups()[0].strip(',')
            print(f'Price is {price}')

This is the input text file:

This is a new document
the price of this is high
The specific price 12,90
Hello. Price 20,00.
A new price 30,40, is really high

This is the output:

./extract_price.py
Price is 12,90
Price is 20,00
Price is 30,40
KostasVlachos
  • 121
  • 1
  • 5
  • regex = r'[pP]rice ([\d,]+)' what does "r" job in that situation? – Gökhan Gider Dec 24 '20 at 18:53
  • Hi, r in front of the string denotes a raw string. It is a best practice to use raw strings when you create regular expressions. See here (https://www.tutorialspoint.com/What-is-Raw-String-Notation-in-Python-regular-expression) and here (https://stackoverflow.com/a/12871105/2864686). – KostasVlachos Dec 24 '20 at 19:45