-1

I have a 13 column csv file I am trying to expand in a particular way. My columns are:

firstName, firstName2, lastName, lastName2, location1, location2, location3, location4, email, email2, phone, phone2, phone3

The data is not perfectly clean (as one can imagine when it comes to people's names) and I want to expand the data for each entry into a large number of possible combinations of the data. Not exactly every possible combination of the data, but close to it. Not every entry has all the data (in fact I don't think any of the rows contain data for every column).

Is the best way to do this really a super-nested multi-branch structure? Essentially right now I'm starting with col1 and testing for a value, then testing for col2, adding each combination to a list of dictionaries and then appending these values to the master list of dictionaries.

For example on dictionary1

  • firstName, lastName, location1, email, phone
  • firstName, lastName, location1, email, phone2
  • ...
  • firstName, lastName, location2, email, phone
  • ...
  • firstNamefirstName2(combined), lastName, location, email, phone

Each row will become like 36 rows (honestly don't know, never been very good at combinatorial math, especially with conditionals).

Is there anything I can use to make this more straightforward? A library or something?

Update: The actual combinatorial algorithm is this:

fn      ln      loc email phones
fn      lnln2   loc email phones
fn      ln2     loc email phones
fnfn2   ln      loc email phones
fnfn2   lnln2   loc email phones
fnfn2   ln2     loc email phones
fn2     ln      loc email phones
fn2     lnln2   loc email phones
fn2     ln2     loc email phones

Where 4 locations, 2 emails, and 3 phones expand.

And I don't want redundancy of empty values. I figure that would be easier to just delete duplications after the csv file is made (that's simple in Excel).

martineau
  • 119,623
  • 25
  • 170
  • 301
mas
  • 1,155
  • 1
  • 11
  • 28
  • There are up to `2 ** 13 - 1 = 8191` non-empty combinations that can arise from each line if the order of columns matters. You might get better answers if you tell us where you are going with this. – hilberts_drinking_problem May 04 '18 at 20:50
  • 1
    I am trying to do some targetted facebook advertising using their custom audience option. You can upload a csv file of various customer data and they can match it to users and target those specific customers. The problem is not all of my data is perfect. I have up to 4 possible resident locations for each customer, multiple phone numbers, and facebook doesn't take multiple inputs for each field. Therefore I want to expand the number of rows with different combinations of each customer's data. – mas May 04 '18 at 20:52
  • I don't think your approach is viable. You need some kind of "combinatorial algorithm" that's smart enough to not produce combinations that make no "sense", which implies at least recognizing what _kind_ data every field contains—which sounds very difficult and error-prone at best. Suggest you think of something else... – martineau May 04 '18 at 21:52
  • I think I've got it. I abandoned my previous approach and ended up with 9 dictionary creations appended to a list on a for loop. Assuming all goes well, I'll post the solution later. I will still end up with duplicates and will have to eliminate them, but that should be trivial. Apparently you can't make a set of dictionaries to guarantee uniqueness, but even excel can eliminate duplicate rows. (Does anyone else end up trying to hit escape on so.com after they've been using vim for 20+minutes?) – mas May 04 '18 at 21:55

1 Answers1

0

The solution was just to create a 3-nested for loop and perform nine dictionary creation statements. This resulted in 216 entries for each row, which then meant I could eliminate duplicates with a simple script I copied from stackoverflow (Removing duplicate rows from a csv file using a python script)

There were still redundancies. For instance, when there was no firstname2 appending the empty firstname2 added a space, and my duplicate finder recognized firstname and firstname+' ' as being separate values and retained the dupes. I don't mind, though, that doesn't matter for my purposes. I just didn't want the redundancy of 216 values for each.

final = [{}]

for agent in agents:

   #create a list of dictionaries for this agent
   finalagents = [{}]
   for i in range(1,4):
    for j in range(1,3):
        for k in range(1,2):
            city = 'city' + str(i)
            state = 'state' + str(i)
            zipcode = 'zip' + str(i)
            phone = 'phone' + str(j)
            email = 'email' + str(k)
            fnfn1 = agent['fn'] + ' ' + agent['fn1']
            lnln1 = agent['ln'] + ' ' + agent['ln1']

            #fn ln
            finalagents.append({'fn': agent['fn'], 'ln': agent['ln'], 'city': agent[city], 'state': agent[state],
                'zip': agent[zipcode], 'phone': agent[phone], 'email': agent[email]}) 
            #fn lnln1 
            finalagents.append({'fn': agent['fn'], 'ln': lnln1, 'city': agent[city], 'state': agent[state],
                'zip': agent[zipcode], 'phone': agent[phone], 'email': agent[email]}) 
            #fn ln1
            finalagents.append({'fn': agent['fn'], 'ln': agent['ln1'], 'city': agent[city], 'state': agent[state],
                'zip': agent[zipcode], 'phone': agent[phone], 'email': agent[email]}) 
            #fnfn1 ln 
            finalagents.append({'fn': fnfn1, 'ln': agent['ln'], 'city': agent[city], 'state': agent[state],
                'zip': agent[zipcode], 'phone': agent[phone], 'email': agent[email]}) 
            #fnfn1 lnln1
            finalagents.append({'fn': fnfn1, 'ln': lnln1, 'city': agent[city], 'state': agent[state],
                'zip': agent[zipcode], 'phone': agent[phone], 'email': agent[email]}) 
            #fnfn1 ln1
            finalagents.append({'fn': fnfn1, 'ln': agent['ln1'], 'city': agent[city], 'state': agent[state],
                'zip': agent[zipcode], 'phone': agent[phone], 'email': agent[email]}) 
            #fn1 ln
            finalagents.append({'fn': agent['fn1'], 'ln': agent['ln'], 'city': agent[city], 'state': agent[state],
                'zip': agent[zipcode], 'phone': agent[phone], 'email': agent[email]}) 
            #fn1 lnln1
            finalagents.append({'fn': agent['fn1'], 'ln': lnln1, 'city': agent[city], 'state': agent[state],
                'zip': agent[zipcode], 'phone': agent[phone], 'email': agent[email]}) 
            #fn1 ln1
            finalagents.append({'fn': agent['fn1'], 'ln': agent['ln1'], 'city': agent[city], 'state': agent[state],
                'zip': agent[zipcode], 'phone': agent[phone], 'email': agent[email]}) 

#add the finalagents list of dictionaries to the official final list of dictionaries in an iterable for loop
for dictionary in finalagents:
    final.append(dictionary)
mas
  • 1,155
  • 1
  • 11
  • 28