1

Im reading a large file, where each row has 20 numbers. I want to end up with a 2D-array, where each row will be a unique row from the file, and in addition for each row I will have the number of times it appeared in the file.

So I did this by building rowDB - a list of lists(where each sub-list is a 20 numbers row from the file), and another list that indicates how many times it appeared:

[uniq, idx] = is_unique(rowDB, new_row)
if (uniq):
   rowDB.append(new_row)
   num_of_occurances.append(1)
else:
   num_of_occurances[idx] += 1

I created this help-function: Check if new_row is unique - i.e does not exist in rowDB. Return uniq = True/False, and if False return also the index of the row in rowDB.

def is_unique(rowDB, new_row):
    for i in range(len(rowDB)):
        row_i = rowDB[i]
        equal = 1
        for j in range (len(row_i)):
            if (row_i[j] != new_row[j]):
                equal = 0 
                break
        if (equal):
            return [False, i]
    return [True, 0]

However when the DB is large it takes lot of time. so My question is what is the most efficient way to perform this? maybe using numpy array instead of lists? and if so, maybe there is a built in numpy function to check if a row is unique and if not to get the row index? how would you build this DB? Thanks!!!

JonyK
  • 585
  • 2
  • 7
  • 12
  • This might solve all of it : http://stackoverflow.com/a/33235665/3293881 – Divakar Aug 30 '16 at 08:48
  • So you suggest to first build the full 2D array that will contain many duplicate rows, and then use the function from your answer to convert the full array to a unique array with the counts? wouldn't it be more efficient to build the DB in a unique manner from the first place? – JonyK Aug 30 '16 at 08:54
  • Well I am not sure how you can build DB in an unique manner and also get the counts in the first place, because at least for the counts you would require all of the information? – Divakar Aug 30 '16 at 10:27

2 Answers2

0

You could probably use Counter for that, there are some good examples on the official documentation.

magne4000
  • 146
  • 1
  • 10
0

You may use tuple to save each line data, and build rowDB using OrderedDict, which maps line tuple to line number, then is_uniq is a simple and quick check as:

return new_row not in rowDB

is_uniq would be:

def is_uniq(rowDB, new_row):
    if new_row in rowDB:
         return False, rowDB[new_row]
    else:
         return True, 0
citaret
  • 416
  • 4
  • 11