0

I need to create a script which will parse a text file containing list of md5 to hashes. My script works as it should for small files, but when speaking about list containing millions of lines I'm receiving IndexError: list index out of range or MemoryError. I've tried experimenting with dictionary but with no luck. For my reference I have used information from this post: How do you read a file into a list in Python? .

Sample file structure (file contains 10mln lines):

00003b63ee5e47514964167709ba60df:ainazulaikha
00004ae02a3cf46250ef834f7b75bb91:78836896hxy7
000066b871abdafac2052532ab9da827:nihao1314521+
0000721897d675d6ac0198ad19d48f21:y138636812709
00008f46c906349f1df99ccdea4104a1:sikaozhanche123
000093856b4e947511870f3e10464129:646434
00009ad044e03d0359e8065a0334a046:LiuYi20011105
0000a4bed6b4a1a6fa96a54ca906e1bd:chiaochiao0520

My script (for testing purposes):

with open('C:/Users/Admin/Downloads/106_17-media_found_hash_plain.txt', 'r') as f:
    string = '00008f46c906349f1df99ccdea4104a1'
    for line in f:
        reg = re.findall("^'?([0-9A-Fa-f]{32})'?:'?([^\s]+)'?", line)
        if string in reg[0][0]:
            print('ok')
CDspace
  • 2,639
  • 18
  • 30
  • 36
pavlos
  • 1

3 Answers3

0

First, reg = re.findall( can return an empty list, so test if list contains something before doing

if string in reg[0][0]:

I'd suggest:

if reg and reg[0] and string in reg[0][0]:

Then, the memory error could happen if you're hitting a veeeery long line which exceeds python memory. Unlikely, but can be done if the file is corrupt / the generation process "forgets" to issue newlines for a while, in which case you have to get the input fixed, otherwise the code will be really more complex.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
0

Your title, description and actual code point in almost 3 different directions but assuming you are simply looking for string you can do this:

with open('C:/Users/Admin/Downloads/106_17-media_found_hash_plain.txt', 'r') as f:
    string = '00008f46c906349f1df99ccdea4104a1'
    for line in f:
        if line.startswith(string):
            print('Gotcha! {}'.format(line))

It might take a while but you will never run out of memory (nothing is being stored) nor will you get an IndexError.

Ma0
  • 15,057
  • 4
  • 35
  • 65
0

The simplest way to search for a substring within a string is to use the method if substring in string. You can do it using re, but it is much less efficient. I've timed a couple methods to show this:

with open('test.txt') as f:
    data = f.readlines()

string = '00008f46c906349f1df99ccdea4104a1'


def func_1(data, string):

    for line in data:
        if re.match(string, line) is not None:
            pass
    return


def func_2(data, string):

    for line in data:
        if re.search(string, line) is not None:
            pass
    return


def func_3(data, string):

    for line in data:
        if string in line:
            pass
    return


def func_4(data, string):

    for line in data:
        if line.startswith(string):
            pass
    return


def func_5(data, string):

    def thing(line):
        string = '00008f46c906349f1df99ccdea4104a1'
        if string in line:
            pass
        return

    map(thing, data)

    return


def func_6(data, string):

    data = [line.split(':')[0] for line in data]

    if string in data:
        pass

    return

And the results:

--------------------
100  iterations
--------------------

func_1: 0.579837208991
func_2: 0.89487306496
func_3: 0.0426233092805
func_4: 0.0963648696288
func_5: 0.113332976336
func_6: 0.10395732091

--------------------
1000  iterations
--------------------

func_1: 5.49227099705
func_2: 5.43578546216
func_3: 0.457362410806
func_4: 0.971125123276
func_5: 1.00572267516
func_6: 1.00902133508

--------------------
10000  iterations
--------------------

func_1: 61.2676211896
func_2: 61.2018943197
func_3: 4.1501189249
func_4: 9.45583133638
func_5: 9.94970703866
func_6: 10.0233565828

*My test file contained 4472 lines.

Evan Nowak
  • 895
  • 4
  • 8