-1

I have somewhere around i million rows and 200 columns, two samples are given below.

a=[0,'aaa', 'bbb', 'ccc',.........200]
b=[1,'aaa', 'ere', 'ccc',.........200]

I want to find the intersection of the two rows.
Normally, from what I read is that intersection with sets are very fast and cost efficient.

But when I convert the above rows(list) into sets the elements inside the list gets disordered.

For example

set(a) becomes {'aaa', 1, 'ccc', 'bbb',.........200]

similarly set(b) gets disordered. According to my requirement I need to find the 1st element i.e the ID column of each row being compared, but because of the set jumbles, I face real issue in getting the 1st element of the rows.

Is there any object that performs equally good as sets while intersection and would also provide me the feasibility to get the first element?

When intersection takes place between a and b, and I do a[0] and b[0] i should get 0 and 1 respectively.

Below is what I want to achieve. I have many rows and columns, I want to create a similarity matrix using the dataset. The dataset is given below (In actual it is an numpy array):

ID   AGE  Occupation Gender Product_range   Product
0   25-34   IT          M   40-50            laptop
1   18-24   Student     F   30-40            desktop
2   25-34   IT          M   40-50            laptop
3   35-44   Research    M   60-70            TV
4   35-44   Research    M   0-1              AC
5   25-34   Lawyer      F   5-6              utensils
6   45-54   Business    F   4-5              toaster

I want to create a similarity matrix out of it (in our case a 6*6) where each matrix element is the similarity between two rows. if you see row 0 and 2 are much similar in-fact the same except for the row number. The row number takes part in the intersection but never attribute to the outcome..

The piece of code I have written for calculating the similarity is given below

data_set = [set(row) for row in data_train]
flattened_upper_triangle_of_the_matrix = []

columns=5  # Id doesn't participate
for row1, row2 in itertools.combinations(data_set, r=2):
    ** here I want to catch the row number, because I want to dtore the rownumber of the two rows who are much similar..**
    intersection_len = row1.intersection(row2)
       flattened_upper_triangle_of_the_matrix.append((len(intersection_len)) / columns)

return flattened_upper_triangle_of_the_matrix
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
Sam
  • 2,545
  • 8
  • 38
  • 59
  • I don't quite understand what you're trying to accomplish. How will intersection find differing elements? – Cameron May 21 '15 at 15:56
  • What do you mean by intersection here? What would the intersection of `[0, 'aaa', ...]` and `[1, 'aaa', ...]` look like? – chepner May 21 '15 at 16:00
  • I want to find the common once, but at the same time want to record the id's of the rows intersecting..the id's for each rows are different – Sam May 21 '15 at 16:01
  • 1
    Give a small but complete example, please. Like, 5 rows and 5 columns. And the desired outcome. – Stefan Pochmann May 21 '15 at 16:02
  • Why does the ID have to be part of the set if it needs to be treated differently from all other elements? Can't you extract it earlier, store it separately...? – Thijs van Dien May 21 '15 at 16:13
  • 1
    I do not agree on this question being a duplicate. OP is not asking for the existence of sorted set but for a solution to his problem. He just implicitly *assumes* that having a sorted set *might* solve it, which imo is not the case. Although he should elaborate more on the expected result. – swenzel May 21 '15 at 16:16
  • Adding to my earlier comment, I how would the ID's would ever intersect at all? Aren't they unique? And when preserving order, what would the intersection of `[0, 'aaa', 'bbb', 1]` and `[1, 'bbb', 'aaa']` look like? – Thijs van Dien May 21 '15 at 16:21
  • Hi Thijs, Since I am dealing with a larger dataset I am trying to skips loops as much as possible, to avoid run time latency – Sam May 21 '15 at 16:35
  • You could do something like `data = {row[0]: set(row[1:]) for row in [a, b]}`. And then loop through like `for row1, row2 in itertools.combinations(data, r=2):` – camz May 21 '15 at 17:08
  • Ya, this would give me the row1 and row2 as 0 and 1 ,but to do the intersection i will have to do something like ,data[0].intersection(data[1]). Fetching value based on key from dictionary would cost me O(1) , so total O(2) for 1 iteration, when done for n iteration the cost would accrue to O(n) – Sam May 21 '15 at 17:35
  • If you want an n*n similarity matrix where n is a million... you're gonna have a bad time. Why do you want it? Surely it's just a means to some end, and we could perhaps help with a better means? – Stefan Pochmann May 21 '15 at 17:42
  • Ah sorry, I meant `for row1, row2 in itertools.combinations(data.iteritems(), r=2):` you would then look at `row1[0]` for the id and `row1[1]` for the set. (saves the lookup) – camz May 21 '15 at 17:52
  • Ya thanks, this would do i presume..... – Sam May 21 '15 at 17:54
  • The way you provide your sample data (`a=[0,'aaa', 'bbb', 'ccc',.........200]`), it almost looks like a flat list rather than tabular. Surely that's not right? – Charles Duffy May 21 '15 at 22:16
  • Its not a flat list too, I convert the tabular data into numpy array using,,, data_train=data_train_cvt.reset_index().values – Sam May 22 '15 at 05:54

1 Answers1

0

You could try the ordered-set package: https://pypi.python.org/pypi/ordered-set

Possible duplicate of your question: Does Python have an ordered set?

Community
  • 1
  • 1