1

Given: Two csv files (1.8 MB each): AllData_1, AllData_2. Each with ~8,000 lines. Each line consists of 8 columns. [txt_0,txt_1,txt_2,txt_3,txt_4,txt_5,txt_6,txt_7,txt_8]

Goal: Based on a match of txt_0 (or, AllData_1[0] == AllData_2 ), compare the contents of the next 4 columns for these individual rows. If the data is unequal, put the entire row for each set of data in an list based on the column being different and save lists to output file. If txt_0 is one data set but not the other, then save that directly to the output file.

Example:

AllData_1 row x contains: [a1, b2, c3, d4, e5, f6, g7, h8] AllData_2 row y contains: [a1, b2, c33c, d44d, e5, f6, g7, h8]

Program saves all of row x and y in lists corresponding to ListCol2 and ListCol3. After all comparing is finished, the lists are saved to file.

How can I make my code faster or change my code to a faster algorithm?

i = 0
x0list = []
y0list = []
col1_diff = col2_diff = col3a_diff = col3b_diff = col4_diff = []

#create list out of column 0
for y in AllData_2:
    y0list.append(y[0])

for entry in AllData_1:
    x0list.append(entry[0])
    if entry[0] not in y0list:
        #code to save the line to file...

for y0 in AllData_2:
    if y0[0] not in x0list:
        #code to save the line to file...

for yrow in AllData_2:
    i+=1

    for xrow in AllData_1:
        foundit = 0
        if yrow[0] == xrow[0] and foundit == 0 and (yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4]):
            if yrow[1] != xrow[1]:
                col1_diff.append(yrow)
                col1_diff.append(xrow)
                foundit = 1

            elif yrow[2] != xrow[2]:
                col2_diff.append(yrow)
                col2_diff.append(xrow)
                foundit = 1

            elif len(yrow[3]) < len(xrow[3]):
                col3a_diff.append(yrow)
                col3a_diff.append(xrow)
                foundit = 1

            elif len(yrow[3]) >= len(xrow[3]):
                col3b_diff.append(yrow)
                col3b_diff.append(xrow)
                foundit = 1

            else:
                #col4 is actually a catch-all for any other differences between lines if [0]s are equal
                col4_diff.append(yrow)
                col4_diff.append(xrow)
                foundit = 1
Mark
  • 81
  • 1
  • 2
  • 6
  • maybe this can help http://stackoverflow.com/questions/744256/reading-huge-file-in-python – Mo J. Mughrabi Aug 14 '11 at 00:56
  • 1
    Use sets not lists for membership testing. – Katriel Aug 14 '11 at 01:06
  • What is `foundit == 0` supposed to do? It's always true because `foundit` is set to 0 just before the test. – Mike Samuel Aug 15 '11 at 06:37
  • I had hoped foundit would speed the comparisons, but changing the code and not thinking about what it does caused these useless comparisons. After removing the foundit comparisons, the program time decreased ~25%. – Mark Aug 16 '11 at 03:46

2 Answers2

1

If you can expect no two lines in a given file to have the same data in column 0, you can significantly improve your code with a few dicts. Instead of the lines

x0list.append(entry[0])
y0list.append(y[0])

You would use:

x0dict[entry[0]] = entry
y0dict[y[0]] = y

after initializing x0dict and y0dict to {}. Then, instead of looping through both complete sets of data again, you can loop over just one of the dicts:

for x0, xrow in x0dict:
    if x0 in y0dict:
        yrow = y0dict[x0]
        # Do the col{1,2,3,4}_diff stuff here

As a bonus, the not in in your second and third loops works the same.


The line

(yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4])

can be replaced with the nicer-looking

yrow[1:5] != xrow[1:5]

As your code stands right now, i is never used, but if you need that count, it ends up being identical to just saying i = len(AllData_2), since it only increments once per run in a loop over AllData_2.


Finally, your foundit variable currently serves no purpose. It is only used to control the flow with foundit == 0, immediately after setting it to 0, so that will always evaluate to True and setting it has no effect.

Aaron Dufour
  • 17,288
  • 1
  • 47
  • 69
  • I don't see why you can't just compare `yrow[1:5]` and `xrow[1:5]` directly? Slices are sequences, which get compared element-wise... – Karl Knechtel Aug 14 '11 at 04:33
  • @Karl Oops, current project is in coffeescript and I'm used to that not working correctly. Edited. – Aaron Dufour Aug 14 '11 at 15:37
  • Thank you much for your response. I have a few follow-up questions. For the majority of the comparisons, I'm looking for the lines with the first column matches. But, I took care of the unnecessary i (not too much speed gain) and the foundit comparison (~25% speed gain). Thanks again! – Mark Aug 15 '11 at 04:15
1

Right of the top, you can make this a lot smaller.

y0list = []
for y in AllData_2:
    y0list.append(y[0])

is just a verbose way of saying

y0list = [y[0] for y in AllData_2]

And you can use in builtin comparisons. The below

(yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4])

can be expressed as

yrow[1:] != xrow[1:]

which is much less prone to copy/paste errors.

To make it faster, you can avoid doing O(n**2) comparisons. Since you only care when the first column element is the same, you can just bundle them by first element.

index = {}
for yrow in AllData_2:
    key = yrow[0]
    list = index.get(key)
    if list is None:
        list = []
        index[key] = list
    list.append(yrow)

for xrow in AllData_1:
    list = index.get(xrow[0])
    if list is None: continue
    for yrow in list:
        # Do all your comparison here
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • Thanks a lot for your help, especially the O(n**2) code. The bundles make work great, except the two initial comparisons are missing now. How can x[0] not in all_y and y[0] not in all_x (different code than original above) be added? – Mark Aug 15 '11 at 04:30