How does one efficiently replace strings in a massive CSV-based array, using dictionaries?

Question

I have a very large array with many rows and many columns (called "self.csvFileArray") that is composed of rows that I've read from a CSV file, using the following code in a class that deals with CSV files...

with open(self.nounDef["Noun Source File Name"], 'rU') as csvFile:
  for idx, row in enumerate(csv.reader(csvFile, delimiter=',')):
    if idx == 0:
      self.csvHeader = row
    self.csvFileArray.append(row)

I have a very long dictionary of replacement mappings that I'd like to use for replacements...

replacements = {"str1a":"str1b", "str2a":"str2b", "str3a":"str3b", etc.}

I'd like to do this in a class method that looks as follows...

def m_globalSearchAndReplace(self, replacements):
  # apply replacements dictionary to self.csvFileArray...

MY QUESTION: What is the most efficient way to replace strings throughout the array "self.csvFileArray", using the "replacements" dictionary?

NOTES FOR CLARIFICATION:

I took a look at this post but can't seem to get it to work for this case.
Also, I want to replace strings within words that match, not just entire words. So, working with a replacement mapping of "SomeCompanyName":"xyz", I may have a sentence like "The company SomeCompanyName has a patent for product called abcSomeCompanyNamedef." You'll notice that the string has to be replaced, twice, in the sentence... once as a whole word and once as an embedded string.

what is the final purpose for `self.csvFileArray`? should all the rows be saved to a new file? — RomanPerekhrest, Oct 20 '17 at 20:42
The self.csvFileArray represents all the rows that were read in from an original CSV file. We're building a "smart scrubber" that cleans and transforms the data by stripping out confidential data in a way that does not lose "key-integrity", before it can be written back out to a new CSV file, which can be sent to vendors to work with. — Information Technology, Oct 20 '17 at 20:49
@MattR... the original CSV is too large. Their are over 300 columns and over 1M rows. Each row represents a person. Each column a descriptive trait. Some are very basic (First Name, Last Name, Age, etc.) Some are financial and health info. Some are paragraphs that provide multi-line comments. — Information Technology, Oct 20 '17 at 21:05

score 1 · Accepted Answer · answered Oct 24 '17 at 01:29

The following works with the above and has been fully tested...

  def m_globalSearchAndReplace(self, dataMap):
    replacements = dataMap.m_getMappingDictionary()
    keys = replacements.keys()
    for row in self.csvFileArray: # Loop through each row/list
      for idx, w in enumerate(row): # Loop through each word in the row/list
        for key in keys: # For every key in the dictionary...
          if key != 'NULL' and key != '-' and key != '.' and key != '':
            w = w.replace(key, replacements[key])
        row[idx] = w

In short, loop through every row in the csvFileArray and get every word.
Then, for every word in the row, loop through the dictionary's (called "replacements") keys to access and apply each mapping.
Then (assuming the right conditions) replace the value with its mapped value (in the dictionary).

NOTE: While it works, I don't believe that the use of endless loops is the most efficient way to solve the problem and I believe there has to be a better way, using regular expressions. So, I'll leave this open for a bit to see if anyone can improve on the answer.

A regular expression would also have to search through the entire time, so performance isn't going to be great. Also, matching a regex pattern is slower than a string comparison... — errantlinguist, Oct 24 '17 at 01:38
I may be able to get something together but unfortunately it will take more time than I have just at the moment... — errantlinguist, Oct 24 '17 at 01:41

simsosims · Answer 2 · 2017-10-20T23:03:36.360

In a big loop? You could just load the csv file as string so you only have to look through your list once instead of for every item. Though its not really very efficient as python strings are immutable, your still facing the same problem either way.

According to this answer Optimizing find and replace over large files in Python (re the efficiency), Maybe line by line would work better so you don't have the giant string in memory if that actually becomes a problem.

edit: So something like this...

# open original and new file.
with open(old_file, 'r') as old_f, open(new_file, 'w') as new_f:
    # loop through each line of the original file (old file)
    for old_line in old_f:
        new_line = old_line
        # loop through your dictionary of replacements and make them.
        for r in replacements:
            new_line = new_line.replace(r, replacements[r])
        # write each line to the new file.
        new_f.write(new_line)

Anyway I would forget the file is a csv file and just treat it like a big collections of lines or characters.

How does one efficiently replace strings in a massive CSV-based array, using dictionaries?

2 Answers2