2

Given the following inputs:

In [18]: input
Out[18]:
   1  2   3  4
0  1  5   9  1
1  2  6  10  2
2  1  5   9  1
3  1  5   9  1

In [26]: df = input.drop_duplicates()
Out[26]:
   1  2   3  4
0  1  5   9  1
1  2  6  10  2

How would I go about getting an array that has the indices of the rows from the subset that are equivalent, eg:

resultant = [0, 1, 0, 0] 

I.e. the '1' here is basically stating that (row[1] in input) == (row[1] in df). Since there will be fewer unique rows than there will be multiple values in 'resultant' that will equate to similar values in df. i.e (row[k] in input == row[k+N] in input) == (row[1] in df) could be a case.

I am looking for actual row number mapping from input:df.

While this example is trivial in my case i have a ton of dropped mappings that might map to one index as an example.

Why do I want this? I am training an autoencoder type system where the target sequence is non-unique.

bge0
  • 901
  • 2
  • 10
  • 25
  • I am confused. Do you mean the indices of the duplicate rows dropped? – Alex Mar 12 '15 at 21:16
  • The indices within the dropped dataframe where the rows are equal to the rows in the input. Ie: row 0 in df is equal to row 0 in input. – bge0 Mar 12 '15 at 21:17
  • Updated question for clarity – bge0 Mar 12 '15 at 21:20
  • Added more info into original question. Does this help? I dont believe that will work because there can be multiple duplicates – bge0 Mar 12 '15 at 22:03
  • Oh, I got it. The 0/1 threw me off. I thought you meant it as boolean but it's just the index number... – JohnE Mar 12 '15 at 22:35

2 Answers2

1

One way would be to treat it as a groupby on all columns:

>> df.groupby(list(df.columns)).groups
{(1, 5, 9, 1): [0, 2, 3], (2, 6, 10, 2): [1]}

Another would be to sort and then compare, which is less efficient in theory but could very well be faster in some cases and is definitely easier to make more tolerant of error:

>>> ds = df.sort(list(df.columns))
>>> eqs = (ds != ds.shift()).all(axis=1).cumsum()
>>> ds.index.groupby(eqs)
{1: [0, 2, 3], 2: [1]}

This seems the right datastructure to me, but if you really do want an array with the group ids, that's easy too, e.g.

>>> eqs.sort_index() - 1
0    0
1    1
2    0
3    0
dtype: int64
DSM
  • 342,061
  • 65
  • 592
  • 494
0

Don't have pandas installed on this computer, but I think you could use df.iterrows() like:

def find_matching_row(row, df_slimmed):
    for index, slimmed_row in df_slimmed.iterrows():
        if slimmed_row.equals(row[slimmed_row.columns]):
            return index

def rows_mappings(df, df_slimmed):
    for _, row in df.iterrows():
        yield find_matching_row(row, df_slimmed)

list(rows_mappings(df, input))

This is if you are interested in generating the resultant list in your example, I don't quite follow the latter part of your reasoning.

deinonychusaur
  • 7,094
  • 3
  • 30
  • 44
  • This defiantly works (just need to change == to .equals() ). Is there a more optimal way of doing this though? – bge0 Mar 12 '15 at 22:11
  • Now bug should be fixed and allows for column dropping. For optimal, if possible, follow the trail at http://stackoverflow.com/questions/10729210/iterating-row-by-row-through-a-pandas-dataframe – deinonychusaur Mar 12 '15 at 22:17