Please note this is a revised/refined copy of my original inquiry that I hope will provide greater clarity than my first attempt. I am newbie in the world of programming trying to create a script that basically does a series of specific find and replaces on a csv using another csv sheet as a corrections guide. (i.e chiken becomes chicken and bcon becomes Bacon)
so in simple cases:
chikn,1,a
bcon,2,b
egs,3,c
becomes
chickn,1,a
bacon,2,b
eggs,3,c
Thus far,using the code below I have built a dictionary based on an input csv and been able to transform most the corrections on the target (to edit) csv as intended in simple cases. The real challenge, however, is that the actual dataset will often have 1-3 entries per cell (with a common delimter of : beween them) and many of these will have spaces (that is, be phrases instead of single words). building on the prior example with a updated dictionary this would be:
The start of:
chk for sandwich:egs,1,a
bcon,2,b
Bcon:egs,3,c
Should end off as:
Chicken for sandwiches:eggs,1,a
Bacon,b,2
Bacon:eggs,3,c
Instead, my current output drops that latter portion and prints
Chicken for sandwiches,1,a
Bacon,b,2
Bacon,3,c
Code:
#!/usr/bin/env python
"""A script for finding and replacing values in CSV files.
"""
import csv
import sys
def main(args):
"""Execute the transformation script.
Args:
args (list of `str`): The command line arguments.
"""
transform(args[1], args[2], create_reps(args[3]), int(args[4]))
def transform(infile, outfile, reps, column):
"""Write a new CSV file with replaced text.
Args:
infile (str): the sheet of original text with errors
outfile (str): the sheet with the revised text with corrections in place of errors
reps (:obj: `str`): dictionary of error word and corrected word
column (int): the column (0 based) the word revisions will take place in
"""
with open(infile) as csvfile:
with open(outfile, 'w') as w:
spamreader = csv.reader(csvfile)
spamwriter = csv.writer(w)
for row in spamreader:
row[column] = replace_all(row[column], reps)
spamwriter.writerow(row)
def create_reps(infile):
"""Create reps object to use as reference dictionary for transform.
Args:
infile (str): The sheet of original and corrected words used to
generate dicitonary
Returns:
(:obj: `str`): a dictionary listing the error words and their
corrections
"""
reps = {}
with open(infile) as csvfile:
dictreader = csv.reader(csvfile)
for row in dictreader:
reps[row[0]] = row[1]
return reps
# def replace_all(text, reps):
#"""Original Version: Iterate through `reps` and replace key => value in `text`.
# Args:
#text (str): The text to search and replace.
# reps (:obj: `str`): Search for `key` and replace with `value`
# Returns:
# (str): The string with the replacements.
"""
# last = text
# for i, j in reps.items():
# text = text.replace(i, j)
# if last != text:
# return text
def new_replace_all(text, reps):
"""Updated Version: Do a single-pass replacement from a dictionary"""
pattern = re.compile(r'\b(' + '|'.join(reps.keys()) + r')\b')
return pattern.sub(lambda x: reps[x.group()], text)
if __name__ == "__main__":
main(sys.argv)
Thank you all in advance for your time and support. I look forward to your guidance on this!
Best.
----------------Updated 4/5/18 -------------------------------------
With the kind support of HFBrowing I have been able to modify this code to work with the sample data-sets I initially provided. In my real world application however, I find it still crashes when exposed to some of the more complex string matches in my dataset. I welcome any suggestions on how to resolve this issue and have provided some examples and the error below.
Ideally, items within a given cell linked by a "|" would stay together and items in a given cell linked by a ":" would be treated as separate strings and replaced separately.
so if:
"A|first" = "A1" and "B|first" = "B1"
then
"A|first:B|first" should transform to be "A1:B1" .
Using this more complex string data, I have provided examples the intended and current outputs along with the error code received.
Sample Dict .
Error word,Correct word .
Actuarial Science,Accounting:Actuarial Science .
Anthropology,Anthropology:General .
Undeclared,Undecided .
Information Technology and Administrative Management|Administrative Management
Specialization, Information Technology and Administrative
Management:Administrative Management Specialization .
Biology,Biology .
Sample Input .
Major,ID,Last .
Actuarial Science,111,Smith .
Anthropology,222,Bob .
Anthropology:Actuarial Science,333,Johnson .
Information Technology and Administrative Management|Administrative Management Specialization,444,Frank .
Undeclared,555,Timmon .
Current Output error:
Traceback (most recent call last):
File "myscript3.py", line 89, in <module> .
main(sys.argv) .
File "myscript3.py", line 21, in main .
transform(args[1], args[2], create_reps(args[3]), int(args[4])) .
File "myscript3.py", line 41, in transform .
row[column] = new_replace_all(row[column], reps) .
File "myscript3.py", line 68, in new_replace_all .
return pattern.sub(lambda x: reps[x.group()], text)
File "myscript3.py", line 68, in <lambda> .
return pattern.sub(lambda x: reps[x.group()], text) .
KeyError: 'Information Technology and Administrative Management' .
Current Output csv .
"Major,ID,Last .
Accounting:Actuarial Science,111,Sumeri .
Anthropology:General,222,Nelson .
Anthropology:General;Accounting:Actuarial Science,333,Newman . "
-----------------------Update 4/6/18: Resolved --------------------------
Hello All,
thank you all for the support. At the suggestion of a colleague I modified the original "Replace_all" code to read as follows. This appears to now work as intended within my context.
Thank you all again for the time and support!
code
#!/usr/bin/env python
"""A script for finding and replacing values in CSV files.
Example::
./myscript school-data.csv outfile-data.csv replacements.csv 4
"""
import csv
import sys
def main(args):
"""Execute the transformation script.
Args:
args (list of `str`): The command line arguments.
"""
transform(args[1], args[2], create_reps(args[3]), int(args[4]))
def transform(infile, outfile, reps, column):
"""Write a new CSV file with replaced text.
Args:
infile (str): the sheet of original text with errors
outfile (str): the sheet with the revised text with corrections in
place of errors
reps (:obj: `str`): dictionary of error word and corrected word
column (int): the column (0 based) the word revisions will take place
in
"""
with open(infile) as csvfile:
with open(outfile, 'w') as w:
spamreader = csv.reader(csvfile)
spamwriter = csv.writer(w)
for row in spamreader:
row[column] = replace_all(row[column], reps)
spamwriter.writerow(row)
def create_reps(infile):
"""Create reps object to use as reference dictionary for transform.
Args:
infile (str): The sheet of original and corrected words used to
generate dicitonary
Returns:
(:obj: `str`): a dictionary listing the error words and their
corrections
"""
reps = {}
with open(infile) as csvfile:
dictreader = csv.reader(csvfile)
for row in dictreader:
reps[row[0]] = row[1]
return reps
def replace_all(text, reps):
"""Iterate through `reps` and replace key => value in `text`.
Args:
text (str): The text to search and replace.
reps (:obj: `str`): Search for `key` and replace with `value`
Returns:
(str): The string with the replacements.
"""
last = text
for i, j in reps.items():
text = text.replace(i, j)
#if last != text:
# return text
return text
if __name__ == "__main__":
main(sys.argv)