How to find N lines containing specific string with offset and reversely?

Question

How to find N lines containing a specific string given an offset reversely? With python on Unix.

Given a file:

a
babc1
c
abc1
abc2
d
e
f

Given the offset: 20 (that's "d"), the string: "abc", N: 2, the output should be:

strings:
# the babc1 will not count since we only need 2
abc1
abc2

offset: (we need to return the offset where the search ends)
10 ((the "a" in "abc1")

The above example is just a demo, the real file is a 33G log, that why I need to take offset as input and output.

I think what the core problem is that: how to reversely read lines from a file with a given offset? The offset is near the tail.

I tried to do it with bash, it was agony. Is there an elegant, efficient way to do it in python2? Besides we will run the script with suitable( an capsule of ansible), so the dependency should be as simple as possible.

rassar · Answer 1 · 2019-10-30T12:39:00.267

0

You can use the following function:

from file_read_backwards import FileReadBackwards

def search(filename, file_size, offset, substring, n):
    off = 0
    with FileReadBackwards(filename) as f:
        while off < (file_size - offset):
            line = f.readline()
            off += len(line)
        found = 0
        for line in f:
            off += len(line)
            if substring in line:
                yield line
                found += 1
            if found >= n:
                yield file_size - off - 1
                return

Use it like this:

s = "s.txt"
file_size = 25
offset = 20
string = "abc"
n = 2

*found, offset = search(s, file_size, offset, string, n)
print(found, offset)

Prints:

['abc2', 'abc1'] 10

edited Oct 30 '19 at 12:39

answered Oct 30 '19 at 12:10

rassar

5,412
3
25
41

sorry, I forgot to mention that file is a 33G log, so can't read the whole string and reverse. – YNX Oct 30 '19 at 12:12
I added support for reading from a file - one byte at a time until offset is reached. – rassar Oct 30 '19 at 12:18
sorry I think I missed some details, the offset is very big, you can assume it like "tail", for a 33G file, the offset is usually 32.99G 32.98G, so your solution is not efficient with this case. – YNX Oct 30 '19 at 12:23
Unfortunately any solution is going to be inefficient without either knowing a) the size of the file or b) the offset represented as a tail because there's no way of knowing how much of a file to read. To read a file backwards you can look at [file-read-backwards](https://pypi.org/project/file-read-backwards/) – rassar Oct 30 '19 at 12:27
the size of file and the offset is known. – YNX Oct 30 '19 at 12:31
I updated it if you know the file size - should be much more efficient. – rassar Oct 30 '19 at 12:40

Holy Mackerel · Answer 2 · 2019-10-30T12:56:24.550

You can use seek to go to the offset in the file as follows:

def reverse_find(string, offset, count):

    with open("FILENAME") as f:
        f.seek(offset)
        results = []

        while offset > 1 and count > 0:
            line = ""
            char = ""

            while char is not "\n":
                offset -= 1
                f.seek(offset)
                char = f.read(1)
                line = char + line

            if string in line:
                results = [line.strip()] + results
                count -= 1

        return results, offset + 1

print(reverse_find("abc", 20, 2))

This will return:

(['abc1', 'abc2'], 10)

score 0 · Accepted Answer · answered Oct 30 '19 at 12:58

0

Thanks for rassar. But I find the answer here https://stackoverflow.com/a/23646049/9782619. More efficient than Mackerel's, require less dependencies than rassar's.

answered Oct 30 '19 at 12:58

YNX

511
6
17

How to find N lines containing specific string with offset and reversely?

3 Answers3