Search and locate unique element repeated in a section from a large file by python

Question

I have many large csv files, in which I'm only interested in the single phrase e.g. 'satoshi'. It appears repeatedly in only one part of every file, e.g. line 25-30, or 36-48 (different per file). I need the code to return the line number of 'satoshi''s first and last appearance. My current code is:

with open('file.csv') as f:
content = f.readlines()
row_number = []
for line in content:
    if 'satoshi' in line:
        row_number.append(content.index(line))
first = row_number[0]
last = row_number[-1]

However this code is inefficient because it searches the whole file line by line even after the only section containing the keyword 'satoshi' has passed. I can't figure out the code to stop searching when the keyword stops appearing in line. Thanks for advice.

Rivers · Answer 1 · 2021-02-26T13:44:02.490

The code must search in every line, there is no other way for a computer program to know if a string contains a substring.

So we have to find a way to iterate each line faster.

Here is how we could make it more efficient:

The call to index slows down your code. So, the first optimization would be to use enumerate:

file = "file.csv"
to_find = "satoshi"

with open(file) as f:
    lines = f.readlines()

row_number = []
for index,line in enumerate(lines):
    if to_find in line:
        row_number.append(index)

first_line_index = row_number[0]
last_line_index = row_number[-1]
first_line = lines[first_line_index]
last_line = lines[last_line_index]
print(first_line)
print(last_line)

An other optimization is to use list comprehension, it wil be way faster:

file = "file.csv"
to_find = "satoshi"

with open(file) as f:
    lines = f.readlines()

lines_indexes = [index for index,line in enumerate(lines) if to_find in line]

first_line_index = lines_indexes[0]
last_line_index = lines_indexes[-1]
first_line = lines[first_line_index]
last_line = lines[last_line_index]
print(first_line)
print(last_line)

Following comments, this is an edit:

I wrote:

Note that the answer of @Janez Kuhar does iterate all lines because the else if statement has, as written, no effect on lines iteration. And in Python there is no else if keyword, just elif. That's a design error too because the elif statement has no relation with the if statement here. Third note: with this code, you will not get the first and the last line, but only the first line. Lastly, there's a problem by using index because if one of yourfile contains identical lines, the call to index will, unfortunately, always return the same line.

Regarding the problem of not getting the real last line containing the substring:

That's the case if the files are structured like this:

abcd
abcd
abcd
abcd
abcd satoshi abcd 1
abcd satoshi abcd 2
abcd satoshi abcd 3
abcd
abcd satoshi abcd 4
abcd
abcd
abcd

Here, you will get the third line as the last line containing the substring, but it should be the fourth.

But, if your files have this structure:

abcd
abcd
abcd
abcd
abcd satoshi abcd 1
abcd satoshi abcd 2
abcd satoshi abcd 3
abcd satoshi abcd 4
abcd
abcd
abcd

So that each time, all line containing the substring are always put the one after the other, @Janez Kuhar code will effectively provide the real last line. And of course in this case there is no need to iterate all the lines.

It was unclear to me that the lines will always be one after the other as @Janez Kuhar pointed out. I thought it could have some other lines (not containing the substring) in between, even if they appear in a specific part of the file.

And by the way, I'm glad we had this constructive and instructive debate !

*"The code must search in every line, there is no other way..."* this is not completely true in this case. But you answer is still relevant because of the speedup. — Janez Kuhar, Feb 26 '21 at 12:16
Thanks for your comment. Could you explain why do you think that *"this is not completely true in this case"* ? — Rivers, Feb 26 '21 at 12:20
Well, the OP said that the `'satochi'` lines are contingent. Hence, you can stop checking the lines immediately after the last `'satochi'` match. — Janez Kuhar, Feb 26 '21 at 12:53
Hi. Thank you for the solution. I need time to digest it but I need to confirm that @JanezKuhar's solution does work (for getting the first and last line number where the phrase appears.) And I don't understand why you say his code would iterate all lines. — Alex, Feb 26 '21 at 13:11
I combined your codes into one as the solution. I mean using enumerate(lines) instead of index(line), and the flag/break ! It works well. — Alex, Feb 26 '21 at 13:27

Janez Kuhar · Accepted Answer · 2021-02-26T13:18:26.733

1

You could add a special boolean flag to your program that would track this for you:

flag = False
for line in content:
    if 'satoshi' in line:
        row_number.append(content.index(line))
        flag = True
    elif flag == True:
        break

Note As @Rivers has hinted, content.index() returns the first index of the matching line, so you may get an incorrect result if there are duplicate 'satoshi' lines. Furthermore, using index() on a list is an O(n) operation, which makes this solution inefficient.

edited Feb 26 '21 at 13:18

answered Feb 26 '21 at 11:27

Janez Kuhar

3,705
4
22
45

Hi, what does ';' do in python?? Someone suggested something similar to your solution just without using';'. But that won't work because the for loop would stop in the first loop because 'satoshi' doesn't appear in the first line and the loop break immediately. I tried yours with ';' and it doesn't yield anything either. – Alex Feb 26 '21 at 11:35
1

@Alex You have unfairly tasked me with another question! ^^ Does this question clear the use of semicolons (`;` ) in Python for you: https://stackoverflow.com/questions/8236380/why-is-semicolon-allowed-in-this-python-snippet ? I will edit my answer to be more consistent. – Janez Kuhar Feb 26 '21 at 11:41
Sorry my bad. Your code does work. But I may need to research on how ';' work in python. – Alex Feb 26 '21 at 11:45
1

@Alex The loop doesn't break if the first line does not contain `'satoshi'`. Try *running* my code and see for yourself. – Janez Kuhar Feb 26 '21 at 11:45
This code is incorrect. See my answer for details, but in a nutshell you will only get the first line, not the last containing the substring and `else if` does not exists in Python. – Rivers Feb 26 '21 at 12:18
@Rivers You are correct, my Python is rusty!. Fixed. – Janez Kuhar Feb 26 '21 at 12:57

Search and locate unique element repeated in a section from a large file by python

2 Answers2