Im reading a large file, where each row has 20 numbers. I want to end up with a 2D-array, where each row will be a unique row from the file, and in addition for each row I will have the number of times it appeared in the file.
So I did this by building rowDB - a list of lists(where each sub-list is a 20 numbers row from the file), and another list that indicates how many times it appeared:
[uniq, idx] = is_unique(rowDB, new_row)
if (uniq):
rowDB.append(new_row)
num_of_occurances.append(1)
else:
num_of_occurances[idx] += 1
I created this help-function: Check if new_row is unique - i.e does not exist in rowDB. Return uniq = True/False, and if False return also the index of the row in rowDB.
def is_unique(rowDB, new_row):
for i in range(len(rowDB)):
row_i = rowDB[i]
equal = 1
for j in range (len(row_i)):
if (row_i[j] != new_row[j]):
equal = 0
break
if (equal):
return [False, i]
return [True, 0]
However when the DB is large it takes lot of time. so My question is what is the most efficient way to perform this? maybe using numpy array instead of lists? and if so, maybe there is a built in numpy function to check if a row is unique and if not to get the row index? how would you build this DB? Thanks!!!