Pandas : determine mapping from unique rows to original dataframe

Question

Given the following inputs:

In [18]: input
Out[18]:
   1  2   3  4
0  1  5   9  1
1  2  6  10  2
2  1  5   9  1
3  1  5   9  1

In [26]: df = input.drop_duplicates()
Out[26]:
   1  2   3  4
0  1  5   9  1
1  2  6  10  2

How would I go about getting an array that has the indices of the rows from the subset that are equivalent, eg:

resultant = [0, 1, 0, 0]

I.e. the '1' here is basically stating that (row[1] in input) == (row[1] in df). Since there will be fewer unique rows than there will be multiple values in 'resultant' that will equate to similar values in df. i.e (row[k] in input == row[k+N] in input) == (row[1] in df) could be a case.

I am looking for actual row number mapping from input:df.

While this example is trivial in my case i have a ton of dropped mappings that might map to one index as an example.

Why do I want this? I am training an autoencoder type system where the target sequence is non-unique.

I am confused. Do you mean the indices of the duplicate rows dropped? — Alex, Mar 12 '15 at 21:16
The indices within the dropped dataframe where the rows are equal to the rows in the input. Ie: row 0 in df is equal to row 0 in input. — bge0, Mar 12 '15 at 21:17
Added more info into original question. Does this help? I dont believe that will work because there can be multiple duplicates — bge0, Mar 12 '15 at 22:03
Oh, I got it. The 0/1 threw me off. I thought you meant it as boolean but it's just the index number... — JohnE, Mar 12 '15 at 22:35

DSM · Accepted Answer · 2015-03-12T22:28:58.853

One way would be to treat it as a groupby on all columns:

>> df.groupby(list(df.columns)).groups
{(1, 5, 9, 1): [0, 2, 3], (2, 6, 10, 2): [1]}

Another would be to sort and then compare, which is less efficient in theory but could very well be faster in some cases and is definitely easier to make more tolerant of error:

>>> ds = df.sort(list(df.columns))
>>> eqs = (ds != ds.shift()).all(axis=1).cumsum()
>>> ds.index.groupby(eqs)
{1: [0, 2, 3], 2: [1]}

This seems the right datastructure to me, but if you really do want an array with the group ids, that's easy too, e.g.

>>> eqs.sort_index() - 1
0    0
1    1
2    0
3    0
dtype: int64

This seems more pandas and faster than my suggestion – deinonychusaur Mar 12 '15 at 22:37 — deinonychusaur, Mar 12 '15 at 22:37

deinonychusaur · Answer 2 · 2015-03-12T22:16:05.980

0

Don't have pandas installed on this computer, but I think you could use df.iterrows() like:

def find_matching_row(row, df_slimmed):
    for index, slimmed_row in df_slimmed.iterrows():
        if slimmed_row.equals(row[slimmed_row.columns]):
            return index

def rows_mappings(df, df_slimmed):
    for _, row in df.iterrows():
        yield find_matching_row(row, df_slimmed)

list(rows_mappings(df, input))

This is if you are interested in generating the resultant list in your example, I don't quite follow the latter part of your reasoning.

edited Mar 12 '15 at 22:16

answered Mar 12 '15 at 21:55

deinonychusaur

7,094
3
30
44

This defiantly works (just need to change == to .equals() ). Is there a more optimal way of doing this though? – bge0 Mar 12 '15 at 22:11
Now bug should be fixed and allows for column dropping. For optimal, if possible, follow the trail at http://stackoverflow.com/questions/10729210/iterating-row-by-row-through-a-pandas-dataframe – deinonychusaur Mar 12 '15 at 22:17

Pandas : determine mapping from unique rows to original dataframe

2 Answers2