I was given two text files with data. File A had incorrect data and file B had the correct data. Using the Pandas library I was able to find the mismatches (~17000!). Now I want to modify file A and replace the incorrect field with the correct one. For e.g.
File A (Incorrect)
Name = PARAMETER_1
Field_1 = a
Field_2 = b
Field_3 = c
Field_4 = WRONG1!
Name = PARAMETER_2
Field_1 = a
Field_2 = b
Field_3 = c
Field_4 = WRONG2!
etc.
should be replaced with:
File A (Correct)
Name = PARAMETER_1
Field_1 = a
Field_2 = b
Field_3 = c
Field_4 = CORRECT1!
Name = PARAMETER_2
Field_1 = a
Field_2 = b
Field_3 = c
Field_4 = CORRECT2!
etc.
The Dataframe looked something like:
Parameter Wrong Correct Match
0 PARAMETER_1 WRONG1! CORRECT1! False
1 PARAMETER_2 WRONG2! CORRECT2! False
etc.
I tried doing it using a for loop:
# read file A
with open(file_A_loc, 'r') as f:
data_text = f.read()
for row in df.itertuples():
new = re.sub(r'(?<=Name = ' + row[1] + r')([\w\W]+?Field_4 = )([\w]+)', r'\g<1>'+row[3], data_text, flags=re.I)
This, you can imagine, took a very long time (File A is ~40-50MB). Any suggestions to speed this process up? I scoured the stackoverflow pages before submitting the question and found references to using a dictionary. I tried using this method but got a KeyError:
def foo(rep_dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile('|'.join(rep_dict.keys()), flags=re.I)
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda x: rep_dict[x.group(0)], text)
rep_dict = {
r'(?<=Name = ' + 'PARAMETER_1' + r')([\w\W]+?Field_4 = )([\w]+)':r'\g<1>'+'CORRECT1!',
r'(?<=Name = ' + 'PARAMETER_2' + r')([\w\W]+?Field_4 = )([\w]+)':r'\g<1>'+'CORRECT2!'
}
bar = foo(rep_dict, data_text)
print(bar)
P.S. Please forgive any markdown transgressions on my part.
UPDATE: I tried implementing the methods here and here. Still taking a very long time though. At least it resolved the KeyError that I was getting earlier.