Using python dictionary to find/replace strings is csv with multiple strings to replace per cell

Question

Please note this is a revised/refined copy of my original inquiry that I hope will provide greater clarity than my first attempt. I am newbie in the world of programming trying to create a script that basically does a series of specific find and replaces on a csv using another csv sheet as a corrections guide. (i.e chiken becomes chicken and bcon becomes Bacon)

so in simple cases:
chikn,1,a
bcon,2,b
egs,3,c

becomes
chickn,1,a
bacon,2,b
eggs,3,c

Thus far,using the code below I have built a dictionary based on an input csv and been able to transform most the corrections on the target (to edit) csv as intended in simple cases. The real challenge, however, is that the actual dataset will often have 1-3 entries per cell (with a common delimter of : beween them) and many of these will have spaces (that is, be phrases instead of single words). building on the prior example with a updated dictionary this would be:

The start of:
chk for sandwich:egs,1,a
bcon,2,b
Bcon:egs,3,c

Should end off as:
Chicken for sandwiches:eggs,1,a
Bacon,b,2
Bacon:eggs,3,c

Instead, my current output drops that latter portion and prints
Chicken for sandwiches,1,a
Bacon,b,2
Bacon,3,c

Code:

#!/usr/bin/env python
"""A script for finding and replacing values in CSV files.

"""

import csv
import sys


def main(args):
    """Execute the transformation script.

    Args:
        args (list of `str`): The command line arguments.

    """
    transform(args[1], args[2], create_reps(args[3]), int(args[4]))


def transform(infile, outfile, reps, column):
    """Write a new CSV file with replaced text.

    Args:
        infile (str): the sheet of original text with errors
        outfile (str): the sheet with the revised text with corrections in place of errors
        reps (:obj: `str`): dictionary of error word and corrected word
        column (int): the column (0 based) the word revisions will take place in

    """
    with open(infile) as csvfile:
        with open(outfile, 'w') as w:
            spamreader = csv.reader(csvfile)
            spamwriter = csv.writer(w)
            for row in spamreader:
                row[column] = replace_all(row[column], reps)
                spamwriter.writerow(row)


def create_reps(infile):
    """Create reps object to use as reference dictionary for transform.

    Args:
        infile (str): The sheet of original and corrected words used to
        generate dicitonary

    Returns:
        (:obj: `str`): a dictionary listing the error words and their
        corrections

    """
    reps = {}
    with open(infile) as csvfile:
        dictreader = csv.reader(csvfile)
        for row in dictreader:
            reps[row[0]] = row[1]

    return reps


# def replace_all(text, reps):
    #"""Original Version: Iterate through `reps` and replace key => value in `text`.

    # Args:
      #text (str): The text to search and replace.
   # reps (:obj: `str`): Search for `key` and replace with `value`

   # Returns:
     # (str): The string with the replacements.

    """
    # last = text
    # for i, j in reps.items():
     #   text = text.replace(i, j)
      #  if last != text:
       #     return text

def new_replace_all(text, reps):
    """Updated Version: Do a single-pass replacement from a dictionary"""
    pattern = re.compile(r'\b(' + '|'.join(reps.keys()) + r')\b')
    return pattern.sub(lambda x: reps[x.group()], text)

if __name__ == "__main__":
    main(sys.argv)

Thank you all in advance for your time and support. I look forward to your guidance on this!

Best.

----------------Updated 4/5/18 -------------------------------------

With the kind support of HFBrowing I have been able to modify this code to work with the sample data-sets I initially provided. In my real world application however, I find it still crashes when exposed to some of the more complex string matches in my dataset. I welcome any suggestions on how to resolve this issue and have provided some examples and the error below.

Ideally, items within a given cell linked by a "|" would stay together and items in a given cell linked by a ":" would be treated as separate strings and replaced separately.

so if:
"A|first" = "A1" and "B|first" = "B1"
then
"A|first:B|first" should transform to be "A1:B1" .

Using this more complex string data, I have provided examples the intended and current outputs along with the error code received.

Sample Dict .
Error word,Correct word .
Actuarial Science,Accounting:Actuarial Science .
Anthropology,Anthropology:General .
Undeclared,Undecided .
Information Technology and Administrative Management|Administrative Management
Specialization, Information Technology and Administrative
Management:Administrative Management Specialization .
Biology,Biology .

Sample Input .
Major,ID,Last .
Actuarial Science,111,Smith .
Anthropology,222,Bob .
Anthropology:Actuarial Science,333,Johnson .
Information Technology and Administrative Management|Administrative Management Specialization,444,Frank .
Undeclared,555,Timmon .

Current Output error:

    Traceback (most recent call last):  
  File "myscript3.py", line 89, in <module> . 
    main(sys.argv) . 
  File "myscript3.py", line 21, in main . 
    transform(args[1], args[2], create_reps(args[3]), int(args[4])) . 
  File "myscript3.py", line 41, in transform . 
    row[column] = new_replace_all(row[column], reps) . 
  File "myscript3.py", line 68, in new_replace_all . 
    return pattern.sub(lambda x: reps[x.group()], text)  
  File "myscript3.py", line 68, in <lambda> .   
    return pattern.sub(lambda x: reps[x.group()], text) .   
KeyError: 'Information Technology and Administrative Management' .

Current Output csv .
"Major,ID,Last .
Accounting:Actuarial Science,111,Sumeri .
Anthropology:General,222,Nelson .
Anthropology:General;Accounting:Actuarial Science,333,Newman . "

-----------------------Update 4/6/18: Resolved --------------------------

Hello All,

thank you all for the support. At the suggestion of a colleague I modified the original "Replace_all" code to read as follows. This appears to now work as intended within my context.

Thank you all again for the time and support!

code

   #!/usr/bin/env python
"""A script for finding and replacing values in CSV files.

Example::
    ./myscript school-data.csv outfile-data.csv replacements.csv 4

"""

import csv
import sys


def main(args):
    """Execute the transformation script.

    Args:
        args (list of `str`): The command line arguments.

    """
    transform(args[1], args[2], create_reps(args[3]), int(args[4]))


def transform(infile, outfile, reps, column):
    """Write a new CSV file with replaced text.

    Args:
        infile (str): the sheet of original text with errors
        outfile (str): the sheet with the revised text with corrections in
            place of errors
        reps (:obj: `str`): dictionary of error word and corrected word
        column (int): the column (0 based) the word revisions will take place
            in

    """
    with open(infile) as csvfile:
        with open(outfile, 'w') as w:
            spamreader = csv.reader(csvfile)
            spamwriter = csv.writer(w)
            for row in spamreader:
                row[column] = replace_all(row[column], reps)
                spamwriter.writerow(row)


def create_reps(infile):
    """Create reps object to use as reference dictionary for transform.

    Args:
        infile (str): The sheet of original and corrected words used to
        generate dicitonary

    Returns:
        (:obj: `str`): a dictionary listing the error words and their
        corrections

    """
    reps = {}
    with open(infile) as csvfile:
        dictreader = csv.reader(csvfile)
        for row in dictreader:
            reps[row[0]] = row[1]

    return reps


def replace_all(text, reps):
    """Iterate through `reps` and replace key => value in `text`.

    Args:
      text (str): The text to search and replace.
    reps (:obj: `str`): Search for `key` and replace with `value`

    Returns:
      (str): The string with the replacements.

    """
    last = text
    for i, j in reps.items():
        text = text.replace(i, j)
        #if last != text:
        #    return text
    return text

if __name__ == "__main__":
    main(sys.argv)

Your google drive link doesn't work and it's a sketchy way to share information anyway. Can you provide an example here that replicates the problems you're having? For example, you could write out as text what should be in col1 and col2 of each CSV (but *minimally* to reproduce problem!) — HFBrowning, Apr 03 '18 at 18:56
Thanks for the heads up. I have fixed the link and included text examples as well. — lostnthought247, Apr 03 '18 at 23:17

score 0 · Answer 1 · answered Apr 05 '18 at 18:33

0

I actually couldn't get your code example to work correctly at replacing things at all, so I'm sure there are some differences in the way I have structured my CSVs compared to what you are doing. Nevertheless I think problem is in your replace_all() function because replacing text sequentially can be tricky. Here's is the solution to that linked question, adapted as a function. Does this fix the problem for you?

def new_replace_all(text, reps):
    """Do a single-pass replacement from a dictionary"""
    pattern = re.compile(r'\b(' + '|'.join(reps.keys()) + r')\b')
    return pattern.sub(lambda x: reps[x.group()], text)

answered Apr 05 '18 at 18:33

HFBrowning

2,196
3
23
42

HDBrowing you are amazing! Thank you for this. It does solve the issue at hand and works with the simplified data version I used for this question. That said, I am still running into an issue with the more complex strings used in the actual dataset. If you would be so kind as to offer your assistance one last time on this, I would be eternally grateful :) I have added edited the question to add in your suggested code and provide additional information on the challenge of using more complex strings. – lostnthought247 Apr 05 '18 at 22:24
I do have to ask - is there a reason why your correction data are poorly formed? (This is asked totally without snark) It would be simpler to map multiple keys to the same value than it would be to bend over backwards to kludge some type of fragile string formatting/regex pattern that would work for the new case you've just described. Unless there is a good reason for it (?) I would heartily recommend doing the up-front work to properly format your correction dictionary/csv. It will make things more robust overall – HFBrowning Apr 06 '18 at 17:51
1

Haha... trust me when I say I agree. Sadly that is actually the root of my issue. Is am dealing with other parties data that they refuse to modify/change so I am trying to build a solution that allows me to modify it into workable data given the poor initial data consistency and the output formatting Reqs provided. That said, I believe a colleague of mine managed to modify the original code to work as needed. I will add this solution to my main question. Excuse my "noobness", but is there any steps needed on my end to ensure you get points for the support? – lostnthought247 Apr 06 '18 at 22:53
To give people points you can either accept the answer by selecting the check mark under their answer (+15 points, +2 to you) or upvote the answer (+10). In this case you clearly shouldn't accept my answer since it didn't solve your question. I am glad you came up with a solution though that worked for you :) – HFBrowning Apr 06 '18 at 22:59
Wel, for what its worth, l I still upvoted your answer. You did provide a solution to my initial question as it was asked.Thanks again for all the help! – lostnthought247 Apr 06 '18 at 23:01

score 0 · Answer 2 · answered Apr 06 '18 at 22:56

#!/usr/bin/env python
"""A script for finding and replacing values in CSV files.

Example::
    ./myscript school-data.csv outfile-data.csv replacements.csv 4

"""

import csv
import sys


def main(args):
    """Execute the transformation script.

    Args:
        args (list of `str`): The command line arguments.

    """
    transform(args[1], args[2], create_reps(args[3]), int(args[4]))


def transform(infile, outfile, reps, column):
    """Write a new CSV file with replaced text.

    Args:
        infile (str): the sheet of original text with errors
        outfile (str): the sheet with the revised text with corrections in
            place of errors
        reps (:obj: `str`): dictionary of error word and corrected word
        column (int): the column (0 based) the word revisions will take place
            in

    """
    with open(infile) as csvfile:
        with open(outfile, 'w') as w:
            spamreader = csv.reader(csvfile)
            spamwriter = csv.writer(w)
            for row in spamreader:
                row[column] = replace_all(row[column], reps)
                spamwriter.writerow(row)


def create_reps(infile):
    """Create reps object to use as reference dictionary for transform.

    Args:
        infile (str): The sheet of original and corrected words used to
        generate dicitonary

    Returns:
        (:obj: `str`): a dictionary listing the error words and their
        corrections

    """
    reps = {}
    with open(infile) as csvfile:
        dictreader = csv.reader(csvfile)
        for row in dictreader:
            reps[row[0]] = row[1]

    return reps


def replace_all(text, reps):
    """Iterate through `reps` and replace key => value in `text`.

    Args:
      text (str): The text to search and replace.
    reps (:obj: `str`): Search for `key` and replace with `value`

    Returns:
      (str): The string with the replacements.

    """
    last = text
    for i, j in reps.items():
        text = text.replace(i, j)
        #if last != text:
        #    return text
    return text

if __name__ == "__main__":
    main(sys.argv)

Using python dictionary to find/replace strings is csv with multiple strings to replace per cell

2 Answers2