0

I am extracting data from PDF:

String Error is on position n=4, but I need to extract value from position n+2 (Value 247156909 xxxx).

4 Error:
5 XZXZXZXZXZXZX
6 247156909 xxxx 
with pdfplumber.open(file) as pdf:
    pages = pdf.pages
    for page in pdf.pages:
        text = page.extract_text()
        for i, line in enumerate(text.split('\n')):
            print(i, line)
            elif re.match(r"Error\s*:", line):
                tot = line.split()  # how can I get line on position i+2
mkrieger1
  • 19,194
  • 5
  • 54
  • 65
user1862965
  • 327
  • 3
  • 15

3 Answers3

2

The methods proposed with .split('\n') will not work on big files (or unlimited streams).

Because you'll put everything into memory.

The correct way is this one:

import itertools

def pairwise_with_offset(iterable, offset: int):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = itertools.tee(iterable)
    [next(b, None) for _ in range(offset)]
    return zip(a, b)

You can find more information there: https://stackoverflow.com/a/5434936/8933502

And please, learn to use the correct way, even if your PDF library is not optimized. Because it's likely that you'll reuse the same way again and again, but maybe on the future, it'll be from a file-like object (or any iterable).

Samuel GIFFARD
  • 796
  • 6
  • 22
1

When you find the line containing Error, you know that the line number containing the value is the current line number i plus 2.

So store that line number in a variable, and when iterating check if the current line number is equal to that number. If the current line number is the one you have previously stored, you got the value:

value_line = None  # initialize with a value that is not a valid line number

for i, line in enumerate(text.split('\n')):
    if re.match(r"Error\s*:", line):
        value_line = i + 2
    if i == value_line:  # this will happen in a later iteration
        print(line)      # this is the line containing the value

Alternatively, collect all lines in a list beforehand. Then you can directly access the desired line from the list and do not need to keep iterating:

lines = text.split('\n')

for i, line in enumerate(lines):
    if re.match(r"Error\s*:", line):
        print(lines[i + 2])
        break  # found the value, can stop iterating

Of course, instead of printing the line containing the value, you can do something else with it, for example split it and convert the first item to an integer.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
1

since "Lines" is a list you can iter on the list and check if item exist and from that point you get the count+1 item.

import re
# Using readlines()
file1 = open('file.txt', 'r')
Lines = file1.readlines()
 
count = 0
# Strips the newline character
for line in Lines:
    count += 1
    if "Error" in line:
        print(Lines[count+1])
hamdi
  • 11
  • 3
  • Counters for loops in python should be avoided. OP was correctly using `enumerate`, which automatically creates the counter. – Samuel GIFFARD Feb 23 '21 at 11:24