0

Python newbie here. I am trying to search through a large document and extract texts. The actual data I need is the value inside the paranthesis: xx(bb) xx(bb) xx can be any number combo and bb is any character inlcuding numbers so you can have a line that has 1(YJ) 2(*). The end goal is to compare the char inside the parenthesis with values in a set: {'hO', 'Ih', 'Dn', '8', 'MF', 'dC', '6', 'RE', 'WM', 'Dh', '5'}. So I would be chekcing to see if YJ and * are inside the set

To this purpose I have written a couple of methods to parse through the giant file. The issue is, this takes a long time. for a file around 1.5GB, it takes 49 secs. For files bigger than 5GB, it takes 5 minutes to search:

Method 1 Works and prints out the line but as all methods here, it is slow:

with open(filename, 'rb', buffering=102400) as f:
    time_data_count = 0
    for line in f:
        # if b'(X,Y)' in line:
        #     print(line)
        if re.search(b'\d+\(.+\)\s+\d+\(.+\)', line) :  
                print(line)

Method 2. Another issue with this method is that it always returns none. Why?

with open(filename, 'rb') as f:
    #text = []
    while True:
        memcap = f.read(102400)
        if not memcap:
            break
        text = re.search(b'\d+\(.+\)\s+\d+\(.+\)',memcap)
        if text is None:
            print("none")

Method 3: THis one only prints a list of 1 element unless the file is below 1GB. Why is this?:

with open(filename, 'rb') as f:
    time_data_count = 0
    text = []
    while True:
        memcap = f.read(102400)
        if not memcap:
            break
        text = re.findall(b'\d+\(.+\)\s+\d+\(.+\)',memcap) 
    print(text)

So those are the three methods I have written. only 1 works as it should. But they all share the same issue of being slow. Is Python Regex just slow in general? Is there a different way to get the type of values I need without having to use Regex? I thought using binary and file buffer would help but this is as fast as it can go. Please help

edo101
  • 629
  • 6
  • 17
  • Perhaps you could use multiprocessing on partitions of the file. – Chris Jun 05 '20 at 17:31
  • @Chris I am new to coding. I've heard parallelization is advanced to do on one file. Now if it was doing it for multiple files, then it seems simple enough with the quick google search I've done. I have no idea how I would parallelize one document – edo101 Jun 05 '20 at 17:35
  • 1
    It is unclear what you try to do. Perhaps you should gives us more examples of your inputs and tell us more about your desired output. I see nothing being compared in your code. – Ωmega Jun 05 '20 at 17:35
  • @Ωmega I am gonna compare the string extracted from the regex to a list that contains a bunch of chars sample set: {'hO', 'Ih', 'Dn', '8', 'MF', 'dC', '6', 'RE', 'WM', 'Dh', '5'} – edo101 Jun 05 '20 at 17:59
  • @Ωmega I have edited the OP to include this set. I need to compare the values inside the paranthesis with the values in the set – edo101 Jun 05 '20 at 18:00
  • Inside of which parenthesis? Your `\d+\(.+\)\s+\d+\(.+\)` contains two of such options. And why you don't compare it with regex right away? – Ωmega Jun 05 '20 at 18:04
  • @Ωmega what do you mean compare it with regex right away? Am I not doing that in my methods? Yes. SO right now I am looking for those values inside the ( ). So the data generall is of the form: xx(bb) xx(bb). I need the bb values. – edo101 Jun 05 '20 at 18:07
  • You are matching `.+` inside of parenthesis, but you should do `(?:hO|Ih|Dn|8|MF|dC|6|RE|WM|Dh|5)` – Ωmega Jun 05 '20 at 18:10
  • @Ωmega so that set is just a sample. I actually have 127 elements in that set. I didnt want to post them here for brevity and also it is a bit of sensitive data. Are you suggesting I list out all 127 elements in code? – edo101 Jun 05 '20 at 18:14
  • You don't have to list it here, just adopt it into your regex pattern – Ωmega Jun 05 '20 at 18:22
  • Can I achieve what I am looking for without using Python's re module? Can I somehow just search using strings interporlated with regex? like if \\d+\(.+\)\ in string type thing. Without using re. The little googling I've done seems to show that Python is slow with regex @Ωmega – edo101 Jun 05 '20 at 18:35
  • You can use `grep` command/tool for regex search within file(s) – Ωmega Jun 05 '20 at 20:16
  • @Ωmega isn't GREP a UNIX based thing? I am on Windows – edo101 Jun 05 '20 at 20:30
  • @edo101 - see https://stackoverflow.com/questions/87350/what-are-good-grep-tools-for-windows – Ωmega Jun 06 '20 at 21:16

1 Answers1

1

If you are doing lots of searches with the same regex pattern, you should use re.compile:

with open(filename, 'rb', buffering=102400) as f:
    time_data_count = 0
    search_pattern = re.compile(b'\d+\(.+\)\s+\d+\(.+\)')
    for line in f:
        if search_pattern.search(line):  
                print(line)
jdaz
  • 5,964
  • 2
  • 22
  • 34
  • what does compile do for you? I will try it now and see how fast it can go – edo101 Jun 05 '20 at 17:44
  • unfortunately it runs in the same speed. Compile didnt help at all – edo101 Jun 05 '20 at 17:49
  • This should help: https://stackoverflow.com/questions/42742810/speed-up-millions-of-regex-replacements-in-python-3/42747503#42747503 – jdaz Jun 05 '20 at 18:07
  • Can I achieve what I am looking for without using Python's re module? Can I somehow just search using strings interporlated with regex? like if \\d+\(.+\)\ in string type thing. Without using re @jdaz. The little googling I've done seems to show that Python is slow with regex – edo101 Jun 05 '20 at 18:13
  • So using compile with the regex 2020.5.14 module "regex.comple" made a world of a difference instead of using regex.search or any of the variation for re module. Time went from 4.5 mins to 34 seconds @jdaz. I looked at your link and I am little lost lol. – edo101 Jun 05 '20 at 20:32