Comparing the first columns in two csv files using python and printing matches

Question

I have two csv files each which contain ngrams that look like this:

drinks while strutting,4,1.435486010883783160220299732E-8
and since that,6,4.306458032651349480660899195E-8
the state face,3,2.153229016325674740330449597E-8

It's a three word phrase followed by a frequency number followed by a relative frequency number.

I want to write a script that finds the ngrams that are in both csv files, divides their relative frequencies, and prints them to a new csv file. I want it to find a match whenever the three word phrase matches a three word phrase in the other file and then divide the relative frequency of the phrase in the first csv file by the relative frequency of that same phrase in the second csv file. Then I want to print the phrase and the division of the two relative frequencies to a new csv file.

Below is as far as I've gotten. My script is comparing lines but only finds a match when the entire line (including the frequencies and relative frequencies) matches exactly. I realize that that is because I'm finding the intersection between two entire sets but I have no idea how to do this differently. Please forgive me; I'm new to coding. Any help you can give me to get a little closer would be such a big help.

import csv
import io 

alist, blist = [], []

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
    reader = csv.reader(fileB, delimiter=',')
    for row in reader:
        blist.append(row)

first_set = set(map(tuple, alist))
secnd_set = set(map(tuple, blist))

matches = set(first_set).intersection(secnd_set)

c = csv.writer(open("matchedngrams.csv", "a"))
c.writerow(matches)

print matches
print len(matches)

Michele d'Amico · Accepted Answer · 2014-12-01T23:15:08.087

1

Without dump res in a new file (tedious). The idea is that the first element is the phrase and the other two the frequencies. Using dict instead of set to do matching and mapping together.

import csv
import io 

alist, blist = [], []

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
    reader = csv.reader(fileB, delimiter=',')
    for row in reader:
        blist.append(row)

f_dict = {e[0]:e[1:] for e in alist}
s_dict = {e[0]:e[1:] for e in blist}

res = {}
for k,v in f_dict.items():
    if k in s_dict:
        res[k] = float(v[1])/float(s_dict[k][1])

print(res)

edited Dec 01 '14 at 23:15

answered Dec 01 '14 at 22:06

Michele d'Amico

22,111
8
69
76

`for k,v in f_dict:` => `ValueError: too many values to unpack`, also `res[k] = v[1]/s_dict[k][1]` => `TypeError: unsupported operand type(s) for /: 'str' and 'str'` – Aprillion Dec 01 '14 at 22:56
@Aprillion I fixed it. try now – Michele d'Amico Dec 01 '14 at 22:58
for tuple unpacking you need to change `f_dict` to `f_dict.items()` – Aprillion Dec 01 '14 at 23:04
@Aprillion sorry I wrote it without testing and fast reload... thx – Michele d'Amico Dec 01 '14 at 23:08
also you repeat `in alist` 2 times and don't use `blist` at all – Aprillion Dec 01 '14 at 23:10

abarnert · Answer 2 · 2014-12-01T23:13:50.163

My script is comparing lines but only finds a match when the entire line (including the frequencies and relative frequencies) matches exactly. I realize that that is because I'm finding the intersection between two entire sets but I have no idea how to do this differently.

This is exactly what dictionaries are used for: when you have a separate key and value (or when only part of the value is the key). So:

a_dict = {row[0]: row for row in alist}
b_dict = {row[0]: row for row in blist}

Now, you can't directly use set methods on dictionaries. Python 3 offers some help here, but you're using 2.7. So, you have to write it explicitly:

matches = {key for key in a_dict if key in b_dict}

Or:

matches = set(a_dict) & set(b_dict)

But you really don't need the set; all you want to do here is iterate over them. So:

for key in a_dict:
    if key in b_dict:
        a_values = a_dict[key]
        b_values = b_dict[key]
        do_stuff_with(a_values[2], b_values[2])

As a side note, you really don't need to build up the lists in the first place just to turn them into sets, or dicts. Just build up the sets or dicts:

a_set = set()
with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        a_set.add(tuple(row))

a_dict = {}
with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        a_dict[row[0]] = row

Also, if you know about comprehensions, all three versions are crying out to be converted:

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    # Now any of these
    a_list = list(reader)
    a_set = {tuple(row) for row in reader}
    a_dict = {row[0]: row for row in reader}

this is a nice lecture, but the problem was to divide relative frequencies, not sure why you want to `do_stuf_with` the absolute frequencies and I am a bit puzzled how sets are supposed to help here... — Aprillion, Dec 01 '14 at 23:01
@Aprillion: The sets are in the OP's question. The whole point is showing that he wants to use dictionaries instead of sets, and what he wants to do instead of set intersection. So I don't know why you thought the sets _were_ supposed to help, when the whole answer is about getting rid of them. — abarnert, Dec 01 '14 at 23:13

Aprillion · Answer 3 · 2014-12-01T23:08:14.860

You could store the relative frequencies from the 1st file into a dictionary, then iterate over the 2nd file and if the 1st column matches anything seen in the original file, write out the result directly to the output file:

import csv

tmp = {}

# if 1 file is much larger than the other, load the smaller one here
# make sure it will fit into the memory
with open("ngrams.csv", "rb") as fr:
    # using tuple unpacking to extract fixed number of columns from each row
    for txt, abs, rel in csv.reader(fr):
        # converting strings like "1.435486010883783160220299732E-8"
        # to float numbers
        tmp[txt] = float(rel)

with open("matchedngrams.csv", "wb") as fw:
    writer = csv.writer(fw)

    # the 2nd input file will be processed per 1 line to save memory
    # the order of items from this file will be preserved
    with open("ngramstest.csv", "rb") as fr:
        for txt, abs, rel in csv.reader(fr):
            if txt in tmp:
                # not sure what you want to do with absolute, I use 0 here:
                writer.writerow((txt, 0, tmp[txt] / float(rel)))

score 0 · Answer 4 · edited May 23 '17 at 11:58

Avoid saving small numbers as they are, they go into underflow problems (see What are arithmetic underflow and overflow in C?), dividing a small number with another will give you even more underflow problem, so do this to preprocess your relative frequencies as such:

>>> import math
>>> num = 1.435486010883783160220299732E-8
>>> logged = math.log(num)
>>> logged
-18.0591772685384
>>> math.exp(logged)
1.4354860108837844e-08

Now to the reading of the csv. Since you're only manipulating the relative frequencies, your 2nd column don't matter, so let's skip that and save the first column (i.e. the phrases) as key and third column (i.e. relative freq) as value:

import csv, math

# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""

textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""

with open('ngrams-1.csv', 'w') as fout:
    for line in textfile.split('\n'):
        fout.write(line + '\n')

with open('ngrams-2.csv', 'w') as fout:
    for line in textfile2.split('\n'):
        fout.write(line + '\n')


# Read and save the two files into a dict structure

ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'

ngramdict1 = {}
ngramdict2 = {}

with open(ngramfile1, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict1[phrase] = math.log(float(rel))

with open(ngramfile2, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict2[phrase] = math.log(float(rel))

Now for the tricky part you want division of the relative frequency of ngramdict2's phrases by ngramdict1's phrases, i.e.:

if phrase_from_ngramdict1 == phrase_from_ngramdict2:
  relfreq = relfreq_from_ngramdict2 / relfreq_from_ngramdict1

Since we kept the relative frequencies in logarithic units, we don't have to divide but to simply subtract it, i.e.

if phrase_from_ngramdict1 == phrase_from_ngramdict2:
  logrelfreq = logrelfreq_from_ngramdict2 - logrelfreq_from_ngramdict1

And to get the phrases that occurs in both, you wont need to check the phrases one by one simply use cast the dictionary.keys() into a set and then doset1.intersection(set2), see https://docs.python.org/2/tutorial/datastructures.html

phrases1 = set(ngramdict1.keys())
phrases2 = set(ngramdict2.keys())
overlap_phrases = phrases1.intersection(phrases2)

print overlap_phrases

[out]:

set(['drinks while strutting', 'the state face', 'and since that'])

So now let's print it out with the relative frequencies:

with open('ngramcombined.csv', 'w') as fout:
    for p in overlap_phrases:
        relfreq1 = ngramdict1[p]
        relfreq2 = ngramdict2[p]
        combined_relfreq = relfreq2 - relfreq1
        fout.write(",".join([p, str(combined_relfreq)])+ '\n')

The ngramcombined.csv looks like this:

drinks while strutting,-0.69314718056
the state face,-1.09861228867
and since that,-0.69314718056

Here's the full code:

import csv, math

# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""

textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""

with open('ngrams-1.csv', 'w') as fout:
    for line in textfile.split('\n'):
        fout.write(line + '\n')

with open('ngrams-2.csv', 'w') as fout:
    for line in textfile2.split('\n'):
        fout.write(line + '\n')


# Read and save the two files into a dict structure

ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'

ngramdict1 = {}
ngramdict2 = {}

with open(ngramfile1, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict1[phrase] = math.log(float(rel))

with open(ngramfile2, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict2[phrase] = math.log(float(rel))


# Find the intersecting phrases.
phrases1 = set(ngramdict1.keys())
phrases2 = set(ngramdict2.keys())
overlap_phrases = phrases1.intersection(phrases2)

# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
    for p in overlap_phrases:
        relfreq1 = ngramdict1[p]
        relfreq2 = ngramdict2[p]
        combined_relfreq = relfreq2 - relfreq1
        fout.write(",".join([p, str(combined_relfreq)])+ '\n')

If you like SUPER UNREADBLE but short code (in no. of lines):

import csv, math
# Read and save the two files into a dict structure
ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'

ngramdict1 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile1, 'r'), delimiter=',')}
ngramdict2 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile2, 'r'), delimiter=',')}

# Find the intersecting phrases.
overlap_phrases = set(ngramdict1.keys()).intersection(set(ngramdict2.keys()))

# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
    for p in overlap_phrases:
        fout.write(",".join([p, str(ngramdict2[p] - ngramdict1[p])])+ '\n')

Comparing the first columns in two csv files using python and printing matches

4 Answers4

Linked