Compare 2 .CSV with unknown number of columns and names

Question

and thanks in advance for any advice. First-time poster here, so I'll do my best to put in all required info. I am also quite beginner with Python, have been doing some online tutorials, and some copy/paste coding from StackOverflow, it's FrankenCoding... So I'm probably approaching this wrong...

I need to compare two CSV files, that will have a changing number of columns, there will only ever be 2 columns that match (for example, email_address in one file, and EMAIL in the other). Both files will have headers, however the names of these headers may change. The file sizes may be anywhere from a few thousand lines up to +2,000,000, with potentially 100+ columns (but more likely to have a handful).

Output is to a third 'results.csv' file, containing all the info. It may be a merge (all unique entries), a substract (remove entries present in one or the other) or an intersect (all entries present in both).

I have searched here, and found a lot of good information, but all of the ones I saw had a fixed number of columns in the files. I've tried dict and dictreader, and I know the answer is in there somewhere, but right now, I'm a bit confused. But since I haven't made any progress in several days, and I can only devote so much time on this, I'm hoping that I can get a nudge in the right direction.

Ideally, I want to learn how to do it myself, which means understanding how the data is 'moving around'.

Extract of CSV files below, I didn't add more columns then (I think) necessary, the dataset I have now will match on Originalid/UID or emailaddress/email, but this may not always be the case.

Original.csv

"originalid","emailaddress",""
"12345678","Bob@mail.com",""
"23456789","NORMA@EMAIL.COM",""
"34567890","HENRY@some-mail.com",""
"45678901","Analisa@sports.com",""
"56789012","greta@mail.org",""
"67890123","STEVEN@EMAIL.ORG",""

Compare.CSV

"email","","DATEOFINVALIDATION_WITH_TIME","OPTOUTDATE_WITH_TIME","EMAIL_USERS"
"Bob@mail.com",,,"true"
"NORMA@EMAIL.COM",,,"true"
"HENRY@some-mail.com",,,"true"
"Henrietta@AWESOME.CA",,,"true"
"NORMAN@sports.CA",,,"true"
"albertina@justemail.CA",,,"true"

Data in results.csv should be all columns from Original.CSV + all columns in Compare.csv, but not the matching one (email) :

"originalid","emailaddress","","DATEOFINVALIDATION_WITH_TIME","OPTOUTDATE_WITH_TIME","EMAIL_USERS"
"12345678","Bob@mail.com","",,,"true"
"23456789","NORMA@EMAIL.COM","",,,"true"
"34567890","HENRY@some-mail.com","",,,"true"

Here are my results as they are now:

email,,DATEOFINVALIDATION_WITH_TIME,OPTOUTDATE_WITH_TIME,EMAIL_USERS
Bob@mail.com,,,true,"['12345678', 'Bob@mail.com', '']"
NORMA@EMAIL.COM,,,true,"['23456789', 'NORMA@EMAIL.COM', '']"
HENRY@some-mail.com,,,true,"['34567890', 'HENRY@some-mail.com', '']"

And here's where I'm at with the code, the print statement returns matching data from the files to screen but not to file, so I'm missing something in there.
***** And I'm not getting the headers from the original.csv file, data is coming in.

import csv

def get_column_from_file(filename, column_name):
    f = open(filename, 'r')
    reader = csv.reader(f)
    headers = next(reader, None)
    i = 0
    max = (len(headers))
    while i < max:
        if headers[i] == column_name:
            column_header = i
 #       print(headers[i])
        i = i + 1
    return(column_header)

file_to_check = "Original.csv"
file_console = "Compare.csv"

column_to_read = get_column_from_file(file_console, 'email')
column_to_compare = get_column_from_file(file_to_check, 'emailaddress')

with open(file_console, 'r') as master:
    master_indices = dict((r[1], r) for i, r in enumerate(csv.reader(master)))

with open('Compare.csv', 'r') as hosts:
    with open('results.csv', 'w', newline='') as results:
        reader = csv.reader(hosts)
        writer = csv.writer(results)

        writer.writerow(next(reader, []))

        for row in reader:
            index = master_indices.get(row[0])
            if index is not None:
                print (row +[master_indices.get(row[0])])
                writer.writerow(row +[master_indices.get(row[0])])

Thanks for your time!

Pat

:-) Use the print statement to confirm that I'm getting the data I want, I forgot to remove that #. However, 'writer' it is still returning less data to the results file then the print statement is returning to screen. Thanks! — Padiwan apprentice, Apr 15 '19 at 19:52

score 0 · Answer 1 · answered Apr 15 '19 at 19:51

0

Right now it looks like you only use writerow once for the header:

writer.writerow(next(reader, []))

As francisco pointed out, uncommenting that last line may fix your problem. You can do this by removing the "#" at the beginning of the line.

answered Apr 15 '19 at 19:51

Evan

2,120
1
15
20

Thanks! I've removed the #. Please see my reply to Francisco. – Padiwan apprentice Apr 15 '19 at 19:54

Tom Lubenow · Accepted Answer · 2019-04-16T20:43:04.287

I like that you want to do this yourself, and recognize a need to "understand how the data is moving around." This is exactly how you should be thinking of the problem: focusing on the movement of data rather than the result. Some people may disagree with me, but I think this is a good philosophy to follow as it will make future reuse easier.

You're not trying to build a tool that combines two CSVs, you're trying to organize data (that happens to come from a CSV) according to a common reference (email address) and output the result as a CSV. Because you are talking about potentially large data sets (+2,000,000 [rows] with potentially 100+ columns) recognize that it is important to pay attention to the asymptotic runtime. If you do not know what this is, I recommend you read up on Big-O notation and asymptotic algorithm analysis. You might be okay without this.

First you decide what, from each CSV, is your key. You've already done this, 'email' for 'Compare.csv' and 'emailaddress' from 'Original.csv'. Now, build yourself a function to produce dictionaries from the CSV based off the key.

def get_dict_from_csv(path_to_csv, key):
    with open(path_to_csv, 'r') as f:
        reader = csv.reader(f)
        headers, *rest = reader  # requires python3
    key_index = headers.index(key)  # find index of key
    # dictionary comprehensions are your friend, just think about what you want the dict to look like
    d = {row[key_index]: row[:key_index] + row[key_index+1:]  # +1 to skip the email entry
         for row in rest}
    headers.remove(key)
    d['HEADERS'] = headers  # add headers so you know what the information in the dict is
    return d

Now you can call this function on both of your CSVs.

file_console_dict = get_dict_from_csv('Compare.csv', 'email')
file_to_check_dict = get_dict_from_csv('Original.csv', 'emailaddress')

Now you have two dicts which are keyed off the same information. Now we need a function to combine these into one dict.

def combine_dicts(*dicts):
    d, *rest = dicts  # requires python3
    # iteratively pull other dicts into the first one, d
    for r in rest:
        original_headers = d['HEADERS'][:]
        new_headers = r['HEADERS'][:]
        # copy headers
        d['HEADERS'].extend(new_headers)
        # find missing keys
        s = set(d.keys()) - set(r.keys())  # keys present in d but not in r
        for k in s:
            d[k].extend(['', ] * len(new_headers))
        del r['HEADERS']  # we don't want to copy this a second time in the loop below
        for k, v in r.items():
            # use setdefault in case the key didn't exist in the first dict
            d.setdefault(k, ['', ] * len(original_headers)).extend(v)
    return d

Now you have one dict which has all the information you want, all you need to do is write it back as a CSV.

def write_dict_to_csv(output_file, d, include_key=False):
    with open(output_file, 'w', newline='') as results:
        writer = csv.writer(results)
        # email isn't in your HEADERS, so you'll need to add it
        if include_key:
            headers = ['email',] + d['HEADERS']
        else:
            headers = d['HEADERS']
        writer.writerow(headers)
        # now remove it from the dict so we can iterate over it without including it twice
        del d['HEADERS']
        for k, v in d.items():
            if include_key:
                row = [k,] + v
            else:
                row = v
            writer.writerow(row)

And that should be it. To call all of this is just

file_console_dict = get_dict_from_csv('Compare.csv', 'email')
file_to_check_dict = get_dict_from_csv('Original.csv', 'emailaddress')
results_dict = combine_dicts(file_to_check_dict, file_console_dict)
write_dict_to_csv('results.csv', results_dict)

And you can easily see how this can be extended to arbitrarily many dictionaries.

You said you didn't want the email to be in the final CSV. This is counter-intuitive to me, so I made it an option in write_dict_to_csv() in case you change your mind.

When I run all the above I get

email,originalid,,,DATEOFINVALIDATION_WITH_TIME,OPTOUTDATE_WITH_TIME,EMAIL_USERS
Bob@mail.com,12345678,,,,true
NORMA@EMAIL.COM,23456789,,,,true
HENRY@some-mail.com,34567890,,,,true
Analisa@sports.com,45678901,,,,,
greta@mail.org,56789012,,,,,
STEVEN@EMAIL.ORG,67890123,,,,,
Henrietta@AWESOME.CA,,,,,true
NORMAN@sports.CA,,,,,true
albertina@justemail.CA,,,,,true

Oh man! That looks awesome, and has plenty for me to sink my teeth into! Don't have time to test it out today, but tomorrow first thing! Thanks a whole bunch Tom! — Padiwan apprentice, Apr 15 '19 at 21:02
OK, I'm working on this and this is as awesome as I initially though! However, I will need one of the 2 email fields in my results. I've tried to do it myself, and I get all the HEADERS, but not their corresponding data, I'm guessing this is related to the dict index? Also, this line has me scratching my head: d = {row[key_index]: row[:key_index] + row[key_index+1:] for row in rest}. I get that it's k:v, but what's happening on the v side? Thanks a million! — Padiwan apprentice, Apr 16 '19 at 15:17
If you want to include the email in the CSV (which I suspected you would want) just call write_dict_to_csv() with include_key=True. Are you familiar with optional parameters in Python? Also, the line you call out would be confusing to someone not familiar with Python's expressive syntax. It's called a [dictionary comprehension](https://www.python.org/dev/peps/pep-0274/). There are also list comprehensions and generator comprehensions which have similar syntax. I could have written the exact same thing as - for row in rest: d.update({row[key_index] : row[:key_index] + row[key_index+1:]}). — Tom Lubenow, Apr 16 '19 at 16:07
If you are confused about the list splicing, I'm trying to assign the value to the email such that it skips over the email. I know what index the email is at in the row, it's at key_index. Python's list splicing includes the lower bound, but does not include the upper bound (for example, range(5)[0:3] == [0, 1, 2]), so we include the first half of the row by saying row[:key_index] and the second half of the row by doing row[key_index+1:], skipping over key_index with the plus 1. In python, adding two lists together simply combines them as you would expect visually. — Tom Lubenow, Apr 16 '19 at 16:13
Thanks! I've set that flag to True and I'm getting all the HEADERS but not all the data. I'm getting 'email' twice (, and I have the 'emailaddress' header but not the data for that column. — Padiwan apprentice, Apr 16 '19 at 16:27
I'm not sure, is it possible that you edited the code somewhat? If I copy and paste exactly what I have above, I get the output that I just added to my answer. Let me know if this isn't the output you're expecting. I essentially did your "merge" strategy, where I included anything that was present in any CSV. — Tom Lubenow, Apr 16 '19 at 16:56
I re-pasted everything, and changed include_key=True. I get the same results as you. However, there are two ',' missing in the results where there no 'originalid' values. So where there is data in both files, data lines up to the columns, but when there isn't the necessary ',' are not being added to keep things aligned, I think. I open the file in CSVd (or Excel), it makes it very easy to spot. I really appreciate you taking the time to help out with this! — Padiwan apprentice, Apr 16 '19 at 17:57
Ya, I get what you want. It may be possible to just change combine_dicts() to accomplish this, however I think it may be simpler to go a different route. Instead of having the CSV dictionaries rely on a list order, we'll assign each email key to a dictionary value, which will contain all the other fields of the CSV. This way, we preserve a key:value mapping which doesn't rely upon order. We'll then need a [recursive dictionary update method](https://stackoverflow.com/questions/3232943/update-value-of-a-nested-dictionary-of-varying-depth), and a call to setdefault(k, '') for missing fields. — Tom Lubenow, Apr 16 '19 at 19:31
You may have may confused me with someone that knows more about Python... :-) This would get done in 'get_dict', right? Would csv.DictReader be appropriate here? — Padiwan apprentice, Apr 16 '19 at 20:06
Disregard my last idea, I believe I have accomplished what you want just by changing the behavior of combine_dicts(). That function is more complicated now, but it now handles keys that exist in one dictionary but not the other, while preserving order properly. I have updated my answer to include this. — Tom Lubenow, Apr 16 '19 at 20:45
Yup, that does it! Just ran it on a normal set (1.5mil and 18k rows respectively) and returns exactly what I want, and completed in about 10 seconds! Thanks again, you win the internet! — Padiwan apprentice, Apr 16 '19 at 20:52

Compare 2 .CSV with unknown number of columns and names

2 Answers2