Given: Two csv files (1.8 MB each): AllData_1, AllData_2. Each with ~8,000 lines. Each line consists of 8 columns. [txt_0,txt_1,txt_2,txt_3,txt_4,txt_5,txt_6,txt_7,txt_8]
Goal: Based on a match of txt_0 (or, AllData_1[0] == AllData_2 ), compare the contents of the next 4 columns for these individual rows. If the data is unequal, put the entire row for each set of data in an list based on the column being different and save lists to output file. If txt_0 is one data set but not the other, then save that directly to the output file.
Example:
AllData_1 row x contains: [a1, b2, c3, d4, e5, f6, g7, h8] AllData_2 row y contains: [a1, b2, c33c, d44d, e5, f6, g7, h8]
Program saves all of row x and y in lists corresponding to ListCol2 and ListCol3. After all comparing is finished, the lists are saved to file.
How can I make my code faster or change my code to a faster algorithm?
i = 0
x0list = []
y0list = []
col1_diff = col2_diff = col3a_diff = col3b_diff = col4_diff = []
#create list out of column 0
for y in AllData_2:
y0list.append(y[0])
for entry in AllData_1:
x0list.append(entry[0])
if entry[0] not in y0list:
#code to save the line to file...
for y0 in AllData_2:
if y0[0] not in x0list:
#code to save the line to file...
for yrow in AllData_2:
i+=1
for xrow in AllData_1:
foundit = 0
if yrow[0] == xrow[0] and foundit == 0 and (yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4]):
if yrow[1] != xrow[1]:
col1_diff.append(yrow)
col1_diff.append(xrow)
foundit = 1
elif yrow[2] != xrow[2]:
col2_diff.append(yrow)
col2_diff.append(xrow)
foundit = 1
elif len(yrow[3]) < len(xrow[3]):
col3a_diff.append(yrow)
col3a_diff.append(xrow)
foundit = 1
elif len(yrow[3]) >= len(xrow[3]):
col3b_diff.append(yrow)
col3b_diff.append(xrow)
foundit = 1
else:
#col4 is actually a catch-all for any other differences between lines if [0]s are equal
col4_diff.append(yrow)
col4_diff.append(xrow)
foundit = 1