1

I was given two text files with data. File A had incorrect data and file B had the correct data. Using the Pandas library I was able to find the mismatches (~17000!). Now I want to modify file A and replace the incorrect field with the correct one. For e.g.

File A (Incorrect)
Name = PARAMETER_1
Field_1 = a
Field_2 = b
Field_3 = c
Field_4 = WRONG1!

Name = PARAMETER_2
Field_1 = a
Field_2 = b
Field_3 = c
Field_4 = WRONG2!
etc.

should be replaced with:

File A (Correct)
Name = PARAMETER_1
Field_1 = a
Field_2 = b
Field_3 = c
Field_4 = CORRECT1!

Name = PARAMETER_2
Field_1 = a
Field_2 = b
Field_3 = c
Field_4 = CORRECT2!
etc.

The Dataframe looked something like:

   Parameter    Wrong    Correct    Match
0  PARAMETER_1  WRONG1!  CORRECT1!  False
1  PARAMETER_2  WRONG2!  CORRECT2!  False
  etc.

I tried doing it using a for loop:

# read file A
with open(file_A_loc, 'r') as f:
        data_text = f.read()

for row in df.itertuples():
    new = re.sub(r'(?<=Name = ' + row[1] + r')([\w\W]+?Field_4 = )([\w]+)', r'\g<1>'+row[3], data_text, flags=re.I)

This, you can imagine, took a very long time (File A is ~40-50MB). Any suggestions to speed this process up? I scoured the stackoverflow pages before submitting the question and found references to using a dictionary. I tried using this method but got a KeyError:

def foo(rep_dict, text): 

  # Create a regular expression  from the dictionary keys
    regex = re.compile('|'.join(rep_dict.keys()), flags=re.I)

  # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda x: rep_dict[x.group(0)], text)

rep_dict = {
            r'(?<=Name = ' + 'PARAMETER_1' + r')([\w\W]+?Field_4 = )([\w]+)':r'\g<1>'+'CORRECT1!',
            r'(?<=Name = ' + 'PARAMETER_2' + r')([\w\W]+?Field_4 = )([\w]+)':r'\g<1>'+'CORRECT2!'
           }
bar = foo(rep_dict, data_text)
print(bar)

P.S. Please forgive any markdown transgressions on my part.

UPDATE: I tried implementing the methods here and here. Still taking a very long time though. At least it resolved the KeyError that I was getting earlier.

  • Why do you care about file A at all? It's wrong, and file B is correct, so why not just use file B? – John Gordon Jan 29 '20 at 00:04
  • @JohnGordon File A is to be subsequently used in other scripts to extract relevant data. File B just lists information about, in this example, Field_4. Also, File A and File B are worded/formatted differently. – DannOfThursday Jan 29 '20 at 01:42
  • Try it without the lookbehind. Capture Name and the Fields_1, 2 and 3 and Fields 4 till after the equals sign. Then match the last word using `\w+` and use only group 1 in the replacement followed by the replacement you want to use. `\b(Name = PARAMETER_1.*(?:\r?\nField_[1-3].*)*\r?\nField_4 = )\w+` See https://regex101.com/r/YRrwpw/1 – The fourth bird Jan 29 '20 at 21:29
  • 1
    @Thefourthbird Thank you for the excellent suggestion! – DannOfThursday Jan 31 '20 at 15:39
  • @DannOfThursday Was there a speed improvement? – The fourth bird Jan 31 '20 at 15:41
  • @Thefourthbird: There wasn't any improvement when using the bulk change method listed previously. I posted the method that I used to get around the issue. – DannOfThursday Jan 31 '20 at 15:53

1 Answers1

0

I resolved my issue using the following basic algorithm:

  1. Capture everything in file A, using re.findall, and get a list of the form: ['Name = PARAMETER1...Field_4 = WRONG1', 'Name = PARAMETER2...Field_4 = WRONG2', ...]

  2. Use Pandas to get the differences between file A and file B.

  3. Iterate over the rows using df.itertuples. Use the indexes in the Pandas dataframe to apply re.sub to the specific element in the list obtained in step 1.

This method, in my use-case, takes about 9-10s to run!