2

I'm looking to implement a few lines of python, using re, to firstly manipulate a string then use that string as a regex search. I have strings with *'s in the middle of them, i.e. ab***cd, with the *'s being any length. The aim of this is to do the regex search in a document to extract any lines that match the starting and finishing characters, with any number of characters in between. i.e. ab12345cd, abbbcd, ab_fghfghfghcd, would all be positive matches. Examples of negative matches: 1abcd, agcd, bb111cd.

I have come up with the regex of [\s\S]*? to input instead of the *'s. So I want to get from an example string of ab***cd to ^ab[\s\S]*?cd, I will then use that for a regex search of a document.

I then wanted to open the file in mmap, search through it using the regex, then save the matches to file.

import re
import mmap 

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

def searchFile(list_txt, raw_str):
    search="^"+raw_str #add regex ^ newline operator
    search_rgx=re.sub(r'\*+',r'[\\s\\S]*?',search) #replace * with regex function

    #search file
    with open(list_txt, 'r+') as f: 
        data = mmap.mmap(f.fileno(), 0)
        results = re.findall(bytes(search_rgx,encoding="utf-8"),data, re.MULTILINE)

    #save results
    f1 = open('results.txt', 'w+b')
    results_bin = b'\n'.join(results)
    f1.write(results_bin)
    f1.close()

    print("Found "+str(file_len("results.txt"))+" results")

searchFile("largelist.txt","ab**cd")

Now this works fine with a small file. However when the file gets larger (1gb of text) I get this error:

Traceback (most recent call last):
  File "c:\Programming\test.py", line 27, in <module>
    searchFile("largelist.txt","ab**cd")
  File "c:\Programming\test.py", line 21, in searchFile
    results_bin = b'\n'.join(results)
MemoryError

Firstly - can anyone help optimize the code slightly? Am I doing something seriously wrong? I used mmap because I know I wanted to look at large files and I wanted to read the file line and by line rather than all at once (hence someone suggested mmap).

I've also been told to have a look at the pandas library for more data manipulation. Would panda's replace mmap?

Thanks for any help. I'm pretty new to python as you can tell - so appreciate any help.

TomOldy
  • 35
  • 5
  • Possible dup of https://stackoverflow.com/questions/25268465/using-mmap-to-apply-regex-to-whole-file – xilpex May 18 '20 at 16:42
  • 1
    No dup. The answer to that question helped me progress to this point. – TomOldy May 18 '20 at 16:50
  • You might indeed be able to use pandas here, but I think it depends on the structure of your data. Pandas has `pandas.Series.str.match`, which would return `True` in _cells_, that fit the regex. So in this case, each line in your doc could be a cell and you'd get a match on the lines/cells that have fit. – Bertil Johannes Ipsen May 18 '20 at 17:40
  • Hi Bertil, the data is a large text file with different strings on newlines: ```ab123 ccv444 sdads444``` like that :) - seems like SO comments won't put stuff on newlines but hopefully you can get what I mean? – TomOldy May 18 '20 at 17:48
  • Right, okay @TomOldy. Check my answer, does it make sense? – Bertil Johannes Ipsen May 18 '20 at 17:54

3 Answers3

2

You are doing line by line processing so you want to avoid accumulating data in memory. Regular file reads and writes should work well here. mmap is backed by virtual memory, but that has to turn into real memory as you read it. Accumulating results in findall is also a memory hog. Try this as an alternate:

import re

# buffer to 1Meg but any effect would be modest
MEG = 2**20

def searchFile(filename, raw_str):
    # extract start and end from "ab***cd"
    startswith, endswith = re.match(r"([^\*]+)\*+?([^\*]+)", raw_str).groups()
    with open(filename, buffering=MEG) as in_f, open("results.txt", "w", buffering=MEG) as out_f:
        for line in in_f:
            stripped = line.strip()
            if stripped.startswith(startswith) and stripped.endswith(endswith):
                out_f.write(line)

# write test file

test_txt = """ab12345cd
abbbcd
ab_fghfghfghcd
1abcd
agcd
bb111cd
"""

want = """ab12345cd
abbbcd
ab_fghfghfghcd
"""

open("test.txt", "w").write(test_txt)

searchFile("test.txt", "ab**cd")

result = open("results.txt").read()
print(result == want)
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • Hi - thanks. Just so I understand this, you're not using regex in the overall search, you're just looking at the start character and end character of each line? If they match the input string `ab**cd`, it writes to file? Sorry bit confused here apologies – TomOldy May 18 '20 at 21:04
  • @TomOldy - I may have misunderstood the problem, but I took it that you have start and end strings separated by asterics, and you want to find strings in the dataset that have these same start and end strings. So I used regex to get them and then switched to a simpler check whether the target strings start and end with those discovered strings. – tdelaney May 19 '20 at 20:51
2

I am not sure what advantage you believe you will get from opening the input file with mmap, but since each string that must be matched is delimited by a newline (as per your comment), I would use the below approach (Note that it is Python, but deliberately kept as pseudo code):

with open(input_file_path, "r") as input_file:
  with open(output_file_path, "x" as output_file:
    for line in input_file:
      if is_match(line):
        print(line, file=output_file)

possibly tuning the endline parameter of the print function to your needs.

This way results are written as they are generated, and you avoid having a large results in memory before writing it. Furthermore, you don't need to concentrate about newlines. Only whether each line matches.

beruic
  • 5,517
  • 3
  • 35
  • 59
  • Thanks for this - how would you feed the example regular expression into this search? For example `^ab^s[\s\S]*?cd` ?? – TomOldy May 18 '20 at 21:07
  • @TomOldy First, I'd probably just use `.*` instead of `[\s\S]*`, and then I dont quite get your usage of `?`. In any case, if you only ever check for starting and ending characters, I'd parse the input string and use `myString.startswith(head)` and `myString.endswith(tail)`. – beruic May 20 '20 at 22:29
  • @TomOldy I have updated my pseudo code to accommodate that your result always is the input line if it matches. – beruic May 20 '20 at 23:21
1

How about this? In this situation, what you want is a list of all of your lines represented as strings. The following emulates that, resulting in a list of strings:

import io

longstring = """ab12345cd
abbbcd
ab_fghfghfghcd
1abcd
agcd
bb111cd
"""

list_of_strings = io.StringIO(longstring).read().splitlines()
list_of_strings

Outputs

['ab12345cd', 'abbbcd', 'ab_fghfghfghcd', '1abcd', 'agcd', 'bb111cd']

This is the part that matters

s = pd.Series(list_of_strings)
s[s.str.match('^ab[\s\S]*?cd')]

Outputs

0         ab12345cd
1            abbbcd
2    ab_fghfghfghcd
dtype: object

Edit2: Try this: (I don't see a reason for you to want to it as a function, but I've done it like that since that what you did in the comments.)

def newsearch(filename):
    with open(filename, 'r', encoding="utf-8") as f:
        list_of_strings = f.read().splitlines()
    s = pd.Series(list_of_strings)
    s = s[s.str.match('^ab[\s\S]*?cd')]
    s.to_csv('output.txt', header=False, index=False)

newsearch('list.txt')

A chunk-based approach

import os

def newsearch(filename):
    outpath = 'output.txt'
    if os.path.exists(outpath):
        os.remove(outpath)
    for chunk in pd.read_csv(filename, sep='|', header=None, chunksize=10**6):
        chunk = chunk[chunk[0].str.match('^ab[\s\S]*?cd')]
        chunk[0].to_csv(outpath, index=False, header=False, mode='a')

newsearch('list.txt')

A dask approach

import dask.dataframe as dd

def newsearch(filename):
    chunk = dd.read_csv(filename, header=None, blocksize=25e6)
    chunk = chunk[chunk[0].str.match('^ab[\s\S]*?cd')]
    chunk[0].to_csv('output.txt', index=False, header=False, single_file = True)

newsearch('list.txt')
Bertil Johannes Ipsen
  • 1,656
  • 1
  • 14
  • 27
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/214222/discussion-on-answer-by-bertil-johannes-ipsen-memoryerror-in-python-by-searching). – Samuel Liew May 20 '20 at 00:24