Index out of range error when looping through file

Question

I need to create a script which will parse a text file containing list of md5 to hashes. My script works as it should for small files, but when speaking about list containing millions of lines I'm receiving IndexError: list index out of range or MemoryError. I've tried experimenting with dictionary but with no luck. For my reference I have used information from this post: How do you read a file into a list in Python? .

Sample file structure (file contains 10mln lines):

00003b63ee5e47514964167709ba60df:ainazulaikha
00004ae02a3cf46250ef834f7b75bb91:78836896hxy7
000066b871abdafac2052532ab9da827:nihao1314521+
0000721897d675d6ac0198ad19d48f21:y138636812709
00008f46c906349f1df99ccdea4104a1:sikaozhanche123
000093856b4e947511870f3e10464129:646434
00009ad044e03d0359e8065a0334a046:LiuYi20011105
0000a4bed6b4a1a6fa96a54ca906e1bd:chiaochiao0520

My script (for testing purposes):

with open('C:/Users/Admin/Downloads/106_17-media_found_hash_plain.txt', 'r') as f:
    string = '00008f46c906349f1df99ccdea4104a1'
    for line in f:
        reg = re.findall("^'?([0-9A-Fa-f]{32})'?:'?([^\s]+)'?", line)
        if string in reg[0][0]:
            print('ok')

In the code you posted, you're not keeping the lines of the file in memory, so if the file is as you indicate, there is no reason you'd run out of memory. — kindall, Nov 20 '17 at 16:59
your file must be corrupt or contains lines that don't match. First test if `reg` isn't empty before accessing its elements. — Jean-François Fabre, Nov 20 '17 at 16:59
Look into Python [generator functions](https://wiki.python.org/moin/Generators) — Jonathan Porter, Nov 20 '17 at 16:59
@Jean-FrançoisFabre Yeah, maybe there's some problem with the line endings where it thinks the whole file is one long line, or something. — kindall, Nov 20 '17 at 17:00
If you are looking for `string` why are you bothering with `regex` and not just do `for line in f: if line.startswith(string + ':'): print(ok)`. Apart from less error-prone, it will probably be faster too. — Ma0, Nov 20 '17 at 17:07

score 0 · Answer 1 · answered Nov 20 '17 at 17:01

First, reg = re.findall( can return an empty list, so test if list contains something before doing

if string in reg[0][0]:

I'd suggest:

if reg and reg[0] and string in reg[0][0]:

Then, the memory error could happen if you're hitting a veeeery long line which exceeds python memory. Unlikely, but can be done if the file is corrupt / the generation process "forgets" to issue newlines for a while, in which case you have to get the input fixed, otherwise the code will be really more complex.

score 0 · Answer 2 · answered Nov 20 '17 at 17:10

Your title, description and actual code point in almost 3 different directions but assuming you are simply looking for string you can do this:

with open('C:/Users/Admin/Downloads/106_17-media_found_hash_plain.txt', 'r') as f:
    string = '00008f46c906349f1df99ccdea4104a1'
    for line in f:
        if line.startswith(string):
            print('Gotcha! {}'.format(line))

It might take a while but you will never run out of memory (nothing is being stored) nor will you get an IndexError.

score 0 · Answer 3 · answered Nov 20 '17 at 17:58

The simplest way to search for a substring within a string is to use the method if substring in string. You can do it using re, but it is much less efficient. I've timed a couple methods to show this:

with open('test.txt') as f:
    data = f.readlines()

string = '00008f46c906349f1df99ccdea4104a1'


def func_1(data, string):

    for line in data:
        if re.match(string, line) is not None:
            pass
    return


def func_2(data, string):

    for line in data:
        if re.search(string, line) is not None:
            pass
    return


def func_3(data, string):

    for line in data:
        if string in line:
            pass
    return


def func_4(data, string):

    for line in data:
        if line.startswith(string):
            pass
    return


def func_5(data, string):

    def thing(line):
        string = '00008f46c906349f1df99ccdea4104a1'
        if string in line:
            pass
        return

    map(thing, data)

    return


def func_6(data, string):

    data = [line.split(':')[0] for line in data]

    if string in data:
        pass

    return

And the results:

--------------------
100  iterations
--------------------

func_1: 0.579837208991
func_2: 0.89487306496
func_3: 0.0426233092805
func_4: 0.0963648696288
func_5: 0.113332976336
func_6: 0.10395732091

--------------------
1000  iterations
--------------------

func_1: 5.49227099705
func_2: 5.43578546216
func_3: 0.457362410806
func_4: 0.971125123276
func_5: 1.00572267516
func_6: 1.00902133508

--------------------
10000  iterations
--------------------

func_1: 61.2676211896
func_2: 61.2018943197
func_3: 4.1501189249
func_4: 9.45583133638
func_5: 9.94970703866
func_6: 10.0233565828

*My test file contained 4472 lines.

Index out of range error when looping through file

3 Answers3