0

I found this while searching around. The answer by Martijn Pieters is what I'm looking to change slightly. Would just comment on his answer.. but don't have any reputation.

Just to put it this is his code:

import csv

with open('masterlist.csv', 'rb') as master:
    master_indices = dict((r[1], i) for i, r in enumerate(csv.reader(master)))

with open('hosts.csv', 'rb') as hosts:
    with open('results.csv', 'wb') as results:    
        reader = csv.reader(hosts)
        writer = csv.writer(results)

        writer.writerow(next(reader, []) + ['RESULTS'])

        for row in reader:
            index = master_indices.get(row[3])
            if index is not None:
                message = 'FOUND in master list (row {})'.format(index)
            else:
                message = 'NOT FOUND in master list'
            writer.writerow(row + [message])

Lets say I have a masterlist.csv that looks like:

ID,  Name,      Date
1234,John Smith,01/01/2020
1235,Jane Smith,01/02/2020
1236,Bob Smith,01/02/2020
1236,Bob Smith,01/05/2020

if you were to print out master_indicies you would get (adjusting the code to use the first row and not the second):

{'1234': 1, '1235': 2, '1236': 4}

The exact code above pretty much does exactly what I need to, except that it will only add Bob Smith's ID to the 'master_indicies' dictionary once even though its in there twice. Essentially, what do I need to change in the 'master_indicies' code to add each line to the dictionary regardless of how many times it is in the csv file? So I get:

{'1234': 1, '1235': 2, '1236': 3,'1236': 4}

Any help is much appreciated! Thanks!

Bernardo Duarte
  • 4,074
  • 4
  • 19
  • 34
deimos
  • 1
  • 1
    You really should not be running into a time where your ID is the same for multiple lines that you expect to be the same person/time/place/whatever. If they are not unique keys, that is an issue. A dictionary works with Key and Value pairs. The **unique** Key is used to retrieve the Value associated with it.. The keys must be unique or else the entire thing would not work. – NoodleBeard Jan 09 '20 at 17:11
  • So, for my case, I do have same IDs on multiple lines. I get these IDs from an API and pull in. A lot of the times the API pulls the same person again on a different day. Some times the line is exactly the same, other times the date might be different (or something else on that line is). But having the same ID in the CSV file is pretty common. I'll have to re-edit my question and maybe include a broader scope of data and what I'm trying to accomplish. This is a small piece of it. – deimos Jan 09 '20 at 17:15
  • maybe you should uses own IDs for rows. Or you would have to use lists `{'1234': [1], '1235': [2], '1236': [3, 4]}` – furas Jan 09 '20 at 20:02

1 Answers1

0

The default Python dictionary does not permit duplicate keys. You will need to use something like

{id: [index for _, index in by_id]
for id, by_id in itertools.groupby(
    ((r[1], i) for i, r in enumerate(csv.reader(master)),
    key=lambda pair: pair[0]
)}

to build your master indices dictionary. This will create a dictionary that maps each ID to the list of corresponding indices.

Do note, however, that itertools.groupby() only groups adjacent rows. If your source data has rows with the same ID that are not adjacent, you will need to sort your rows by their ID first, using an expression like sorted(csv.reader(master), key=lambda row: row[1]) in place of just the csv.reader(master).

FallenWarrior
  • 656
  • 3
  • 16