0

I have a dataset where each row is a sample, and a column (name "Sample_ID") names each sample (df1 below). Some samples are repeated multiple times (i.e. have identical values for "Sample_ID"). I would like to generate a new column with different names for each sample (I'll call it "Sample_code") based on a simple ascending pattern (e.g. SAMP001, SAMP002, SAMP003 etc) from the first row to the last row in the table. But rows with identical Sample_IDs need to have identical Sample_code values as well (so I can't simply generate an ascending set of sample names for the new column).

In the example data below, df1 represents my starting data. df2 is what I want to end up with: the Sample_code column values ascend as you go down each row, but with the same value for the rows where Sample_ID is duplicated.

I'm quite puzzled where to start so any help would be much appreciated, thank you.

import numpy as np
import pandas as pd

# df1
data1 = {'Sample_ID': ['123123','123456','123123','123789','456789', '123654'], 
    'Variable_A': [15,12,7,19,3,12],
    'Variable_B':["blue","red","red","blue","blue", "red"]}
df1 = pd.DataFrame(data1)


# df2
data2 = {'Sample_ID': ['123123','123456','123123','123789','456789', '123654'],
     'Sample_code' : ['SAMP001', 'SAMP002', 'SAMP001', 'SAMP003', 'SAMP004', 'SAMP005'],
    'Variable_A': [15,12,7,19,3,12],
    'Variable_B':["blue","red","red","blue","blue", "red"]}
df2 = pd.DataFrame(data2)

df1
df2

EDIT Ideally I would like to have the ascending Sample_code names be in the original order of the rows, as the rows in the starting dataset are ordered by date of collection. I'd like the Sample_code names to be based on the first time a particular sample appears as you go down the rows. A new illustrative df3 has the date column to give a sense of what I mean.

# df3
data3 = {'Sample_ID': ['123123','123456','123123','123789','456789', 
'123654', '123123', '123789'], 
        'Date' : ['15/06/2019', '23/06/2019', '30/06/2019', '07/07/2019',
                  '15/07/2019', '31/07/2019', '12/08/2019', '27/08/2019'],
        'Variable_A': [15,12,7,19,3,12,7,9],
        'Variable_B':["blue","red","red","blue","blue", "red","blue", "red"]}
df3 = pd.DataFrame(data3)
df3

The solution suggested below works, but it creates Sample_code names based on the final row in which the repeated Sample_ID values appear, e.g. Sample_ID "123123" is labelled "SAMP006" (for the final row this value appears), but I'd like this one to be "SAMP001" (the first row in which it appears).

lookup = {}
for i, sample_name in enumerate(df3.Sample_ID):
    lookup[sample_name] = f'SAMP{i:03}'

df3['Sample_code'] = df3.Sample_ID.apply(lambda x: lookup[x])
df3
Will Hamilton
  • 357
  • 2
  • 17

2 Answers2

2

Use groupby to get each group number and then apply string formatting, eg:

df1['Sample_code'] = df1.groupby('Sample_ID').ngroup().add(1).apply('SAMP{:03}'.format)
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • Thanks. While this does work, it has the effect of ordering the ascending Sample_code names by the Sample_ID value (which is an arbitrary string of numbers). The rows are already ordered by date so I'd prefer to keep their order for the Sample_code names. – Will Hamilton Mar 27 '20 at 08:03
  • @WillHamilton you should be able to just add `sort=False` to the `.groupby(...)` – Jon Clements Mar 27 '20 at 11:17
  • Adding sort=False to this line worked perfectly. Thank you. – Will Hamilton Mar 29 '20 at 22:27
1

You can create a lookup table by iterating over the unique values and then apply it to a new column:

lookup = {}
for i, sample_name in enumerate(df.Sample_ID.unique()):
    lookup[sample_name] = f'SAMP{i:03}'

df['Sample_code'] = df.Sample_ID.apply(lambda x: lookup[x])
Philip Ciunkiewicz
  • 2,652
  • 3
  • 12
  • 24
  • Thanks. Please see my edit - while your solution does work, ideally I would like the Sample_code values to be based on the first row in which a repeated Sample_ID value appears, rather than the final row. Is there a way of doing this? – Will Hamilton Mar 27 '20 at 08:21
  • @WillHamilton Oops! I neglected to add `.unique()` so it was iterating through the full DataFrame! I even said " iterating over the unique values " in my blurb haha. I have included the edits. – Philip Ciunkiewicz Mar 27 '20 at 19:08
  • Yep this works perfectly thanks. I just tweaked to make the numbers start from 1 rather than 0 for Sample_code with the lookup line (lookup[sample_name] = f'SAMP{i+1:03}'). – Will Hamilton Mar 29 '20 at 22:24
  • Consider dictionary comprehension: `lookup = {sample_name: f'SAMP{i:03}' for i, sample_name in enumerate(df.Sample_ID.unique())}`. Also, [use `map`](https://stackoverflow.com/q/49259580/1422451) instead of `apply`: `df['Sample_ID'].map(lookup)`. – Parfait Mar 29 '20 at 22:42