0

I have a python script that is running over 1M lines with different lengths. The script runs very slow. It has been running only over 30000 of them for the last 12 hours. Splitting the file is out of question since the file already split. My code looks like this:

regex1 = re.compile(r"(\{\{.*?\}\})", flags=re.IGNORECASE)
regex2 = re.compile(r"(<ref.*?</ref>)", flags=re.IGNORECASE)
regex3 = re.compile(r"(<ref.*?\/>)", flags=re.IGNORECASE)
regex4 = re.compile(r"(==External links==.*?)", flags=re.IGNORECASE)
regex5 = re.compile(r"(<!--.*?-->)", flags=re.IGNORECASE)
regex6 = re.compile(r"(File:[^ ]*? )", flags=re.IGNORECASE)
regex7 = re.compile(r" [0-9]+ ", flags=re.IGNORECASE)
regex8 = re.compile(r"(\[\[File:.*?\]\])", flags=re.IGNORECASE)
regex9 = re.compile(r"(\[\[.*?\.JPG.*?\]\])", flags=re.IGNORECASE)
regex10 = re.compile(r"(\[\[Image:.*?\]\])", flags=re.IGNORECASE)
regex11 = re.compile(r"^[^_].*(\) )", flags=re.IGNORECASE)

fout = open(sys.argv[2],'a+')

with open(sys.argv[1]) as f:
    for line in f:
        parts=line.split("\t")
        label=parts[0].replace(" ","_").lower()
        line=parts[1].lower()
        try:
            line = regex1.sub("",line )
        except:
            pass
        try:
            line = regex2.sub("",line )
        except:
            pass
        try:
            line = regex3.sub("",line )
        except:
            pass
        try:
            line = regex4.sub("",line )
        except:
            pass
        try:
            line = regex5.sub("",line )
        except:
            pass
        try:
            line = regex6.sub("",line )
        except:
            pass
        try:
            line = regex8.sub("",line )
        except:
            pass
        try:
            line = regex9.sub("",line )
        except:
            pass
        try:
            line = regex10.sub("",line )
        except:
            pass

        try:     
            for match in re.finditer(r"(\[\[.*?\]\])", line):
                replacement_list=match.group(0).replace("[","").replace("]","").split("|")
                replacement_list = [w.replace(" ","_") for w in replacement_list]
                replacement_for_links=' '.join(replacement_list)
                line = line.replace(match.group(0),replacement_for_links)
        except:
            pass
        try:
            line = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', line, flags=re.MULTILINE)  
        except:
            pass    
        try:
            line = line.translate(None, '!"#$%&\'*+,./:;<=>?@[\\]^`{|}~')
        except:
            pass        
        try:
            line = line.replace(' (',' ')   
            line=' '.join([word.rstrip(")") if not '(' in word else word for word in line.split(" ")])
            line=re.sub(' isbn [\w-]+ ',' ' ,line)
            line=re.sub(' [p]+ [\w-]+ ',' ' ,line)
            line = re.sub( ' \d+ ', ' ', line)
            line= re.sub("^\d+\s|\s\d+\s|\s\d+$", " ", line)
            line = re.sub( '\s+', ' ', line).strip()
            line=re.sub(' isbn [\w-]+ ',' ' ,line)
        except:
            pass    
        out_string=label+"\t"+line
        fout.write(out_string)
        fout.write("\n")

fout.close()

Is there any change that I can gain significant improvement over the current version?

UPDATE 1: After profiling using the suggestion by @fearless_fool I realized that regex3 and regex9 and http removal are the least efficient ones.

UPDATE 2: It's just interesting to find out that using .* adds lot more to the steps of the regex patterns. I tried to replace that with [^X]* where X is something I know it never happens in the string. It improves about 20x for 1000 long lines. For example now regex1 is regex1 = re.compile(r"(\{\{[^\}]*?\}\})", flags=re.IGNORECASE) .... If I want to use two characters in negative matching, I don't know how to do it. For example if I want to change (\{\{[^\}]*?\}\}) to (\{\{[^\}\}]*?\}\}) which I know know is wrong since any word in [] is considered as separate characters.

Nick
  • 367
  • 4
  • 6
  • 13
  • 2
    why are you using excepts? How are you expecting `line = regex1.sub("",line )` etc.. to error? – Padraic Cunningham Dec 30 '15 at 17:53
  • I strongly advise that you [run a profiler](https://docs.python.org/2/library/profile.html) on your code to take the guesswork out. – danmcardle Dec 30 '15 at 17:55
  • 1
    You are using around 20 successive regexes or text iterations on every line, it can only run slowly... what do you expect your code to do ? Can't you use higher level parsers for that (e.g xml parser...) ? – Diane M Dec 30 '15 at 17:58
  • 4
    You forgot `regex7` and `regex11`. This is one of the many, many reasons to use lists and loops instead of numbered variables. Also, instead of over a dozen different deletion steps, can you use a regex to search for the stuff you want to *keep*? – user2357112 Dec 30 '15 at 17:58
  • @PadraicCunningham I expect some character encoding-decoding errors. That's why I have try, except. I don't think removing try will improve the speed. – Nick Dec 30 '15 at 18:05
  • @Nick, where are you encoding/decoding? Your code is almost impossible to follow as you have no comments or structure and you catch every exception. – Padraic Cunningham Dec 30 '15 at 18:06
  • 1
    It's also worth opening the lovely resource https://regex101.com/#python and testing out your regexen to see how many steps each one takes -- think of it as a regex-specific profiler. – fearless_fool Dec 30 '15 at 18:07
  • 1
    @fearless_fool this is an awesome tool. Thanks. Now I know that these regex patterns takes the most steps: regex3, regex9, and http removal. – Nick Dec 30 '15 at 18:48
  • 1
    It is good that you posted your code. It would be nice to also see a *small* amount of input and expected output to get a feeling for what you are trying to accomplish. As you can see from the comments we have some trouble following it. :-) See [How to create a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). – Roland Smith Dec 30 '15 at 21:29
  • This is a small version that I test my regex on to profile them and decrease the steps: https://regex101.com/r/zH9jJ2/1 – Nick Dec 30 '15 at 22:15
  • Use `sed` instead. It is far easier to describe your simple transformation than with that lump of code. – msw Dec 31 '15 at 23:44

2 Answers2

1

(Elevating what was a comment to an answer): I recommend you use the elegant and useful Regex 101 Tool to profile your individual regexen and see if any of them are taking an inordinate amount of time.

While you're at it, you could post a complete example on the site so others can see what you're using for typical input. (I realize that you've already done this - great!)

fearless_fool
  • 33,645
  • 23
  • 135
  • 217
0

After using the useful Regex tool recommended by @fearless_fool, I improved the speed significantly by replacing .* with a regex that represents a more stricted version of .* for example: [^\]]*. These changes in the whole script improved the performance significantly.

Nick
  • 367
  • 4
  • 6
  • 13