1

In this pandas dataframe:

df =

pos    index  data
21      36    a,b,c
21      36    a,b,c
23      36    c,d,e
25      36    f,g,h
27      36    g,h,k
29      39    a,b,c
29      39    a,b,c
31      39    .
35      39    c,k
36      41    g,h
38      41    k,l
39      41    j,k
39      41    j,k

I want to remove the repeated line that are only in the same index group and when they are in the head regions of the subframe.

So, I did:

 df_grouped = df.groupby(['index'], as_index=True)

now,

 for i, sub_frame in df_grouped:
    subframe.apply(lamda g: ... remove one duplicate line in the head region if pos value is a repeat)

I want to apply this method because some pos value will be repeated in the tail region which should not be removed.

Any suggestions.

Expected output:

 pos    index  data
removed
21      36    a,b,c
23      36    c,d,e
25      36    f,g,h
27      36    g,h,k
removed
29      39    a,b,c
31      39    .
35      39    c,k
36      41    g,h
38      41    k,l
39      41    j,k
39      41    j,k
everestial007
  • 6,665
  • 7
  • 32
  • 72
  • What about `df.drop_duplicates()` as in http://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas ? – Craig Mar 20 '17 at 01:52
  • A simple `drop function would have worked` but I want to only drop it when the repeat is in the head region of the `sub_frame (grouped by index values)`. That's the main problem. – everestial007 Mar 20 '17 at 01:54
  • @Craig: I just looked over the example and it won't work. Rather than specifying columns I have to specify rows in each `subframe` after doing groupby (there might be other method though). Also, only one duplicate needs to be dropped not both and only in the head regions (top two lines) of the subframe. – everestial007 Mar 20 '17 at 01:57
  • please let me know if either of the answers works for you. – Craig Mar 20 '17 at 02:37
  • I think they do in this context. But, I am tyring to take this answer and fit in another big script so, don't want to run in an issue. Btw, I think there is another way to remove a row I want to. Let me post that question. – everestial007 Mar 20 '17 at 02:40
  • @Craig: If you have a chance can you please look in to this question. http://stackoverflow.com/questions/42895061/how-to-remove-a-row-from-pandas-dataframe-based-on-the-length-of-the-column-valu – everestial007 Mar 20 '17 at 02:47

1 Answers1

1

If it doesn't have to be done in a single apply statement, then this code will only remove duplicates in the head region:

data= {'pos':[21, 21, 23, 25, 27, 29, 29, 31, 35, 36, 38, 39, 39],
       'idx':[36, 36, 36, 36, 36, 39, 39, 39, 39, 41, 41, 41, 41], 
       'data':['a,b,c', 'a,b,c', 'c,d,e', 'f,g,h', 'g,h,k', 'a,b,c', 'a,b,c', '.', 'c,k', 'g,h', 'h,l', 'j,k', 'j,k']
}

df = pd.DataFrame(data)

accum = []
for i, sub_frame in df.groupby('idx'):
    accum.append(pd.concat([sub_frame.iloc[:2].drop_duplicates(), sub_frame.iloc[2:]]))

df2 = pd.concat(accum)

print(df2)

EDIT2: The first version of the chained command that I posted was wrong and and only worked for the sample data. This version provides a more general solution to remove duplicate rows per the OP's request:

df.drop(df.groupby('idx')         # group by the index column
          .head(2)                # select the first two rows
          .duplicated()           # create a Series with True for duplicate rows
          .to_frame(name='duped') # make the Series a dataframe
          .query('duped')         # select only the duplicate rows
          .index)                 # provide index of duplicated rows to drop
Craig
  • 4,605
  • 1
  • 18
  • 28