I have somewhere around i million rows and 200 columns, two samples are given below.
a=[0,'aaa', 'bbb', 'ccc',.........200]
b=[1,'aaa', 'ere', 'ccc',.........200]
I want to find the intersection of the two rows.
Normally, from what I read is that intersection with sets are very fast and cost efficient.
But when I convert the above rows(list) into sets the elements inside the list gets disordered.
For example
set(a) becomes {'aaa', 1, 'ccc', 'bbb',.........200]
similarly set(b) gets disordered. According to my requirement I need to find the 1st element i.e the ID column of each row being compared, but because of the set jumbles, I face real issue in getting the 1st element of the rows.
Is there any object that performs equally good as sets while intersection and would also provide me the feasibility to get the first element?
When intersection takes place between a and b, and I do a[0] and b[0] i should get 0 and 1 respectively.
Below is what I want to achieve. I have many rows and columns, I want to create a similarity matrix using the dataset. The dataset is given below (In actual it is an numpy array):
ID AGE Occupation Gender Product_range Product
0 25-34 IT M 40-50 laptop
1 18-24 Student F 30-40 desktop
2 25-34 IT M 40-50 laptop
3 35-44 Research M 60-70 TV
4 35-44 Research M 0-1 AC
5 25-34 Lawyer F 5-6 utensils
6 45-54 Business F 4-5 toaster
I want to create a similarity matrix out of it (in our case a 6*6) where each matrix element is the similarity between two rows. if you see row 0 and 2 are much similar in-fact the same except for the row number. The row number takes part in the intersection but never attribute to the outcome..
The piece of code I have written for calculating the similarity is given below
data_set = [set(row) for row in data_train]
flattened_upper_triangle_of_the_matrix = []
columns=5 # Id doesn't participate
for row1, row2 in itertools.combinations(data_set, r=2):
** here I want to catch the row number, because I want to dtore the rownumber of the two rows who are much similar..**
intersection_len = row1.intersection(row2)
flattened_upper_triangle_of_the_matrix.append((len(intersection_len)) / columns)
return flattened_upper_triangle_of_the_matrix