Python Pandas, All occurrences of text after multiple of the same delimiter

Question

I would like to pull the numbers after the text "[Unique ID] " until the next space. I would then like to create a new column that will display each unique UID that has been pulled.

I am able to grab the first occurrence, but am not able to grab all occurrences.

Here is the code I have been using:

    from pandas import DataFrame

    Info = {'ID': ['1','2','3'],
        'Name': ['Tom Johnson', 'Ben Thompson', 'Mike'],
        'Information': ["[Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique ID] 1438 [Unique ID] 1439","[Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique ID] 1101","[Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique ID] 1498"]}

    df = DataFrame(Info,columns= ['ID', 'Name', 'Information'])


    df['UID'] = [df.split("[Unique ID] ")[1].split(" ")[0] for df in df['Information']]

As you can see it will only grab the first occurrence of after "[Unique ID] ". However, I would like all occurrences.

The desired output would be,

    Info2 = {'ID': ['1','1','1','2','3','3'],
        'Name': ['Tom Johnson', 'Tom Johnson', 'Tom Johnson', 'Ben Thompson', 'Mike', 'Mike'],
        'Information': ["[Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique ID] 1438 [Unique ID] 1439",
                        "[Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique ID] 1438 [Unique ID] 1439",
                        "[Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique ID] 1438 [Unique ID] 1439",
                        "[Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique ID] 1101",
                        "[Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique ID] 1498",
                        "[Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique ID] 1498"],
        'UID': ['1424','1438', '1439', '1101', '1424', '1498']}

    df2 = DataFrame(Info2,columns= ['ID','Name', 'Information', 'UID'])

As shown, it has each unique UID for each record, and if the record contains multiple of the same UID's it does not create a new record.

Thanks!

The logic is not clear. There are multiple `Unique ID` each row, so I'd say they are not unique. Which one do you want to pick? In your expected result you have only one number, but sometimes is the first, sometimes the second, and so on — Valentino, Jun 25 '19 at 13:00

score 1 · Answer 1 · answered Jun 25 '19 at 13:01

You can use str.split and `melt:

new_df = pd.concat((df[['Name', 'Information']],
                    df.Information.str.split('\[Unique ID\]', expand=True)),
               axis=1)
new_df.drop(0, axis=1, inplace=True)

(new_df.melt(id_vars=['Name', 'Information'], 
            value_name='UID')
       .drop('variable', axis=1)
       .dropna()
)

Output

           Name                                        Information     UID
0   Tom Johnson  [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique...   1424 
1  Ben Thompson  [Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique...   1101 
2          Mike  [Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique...   1424 
3   Tom Johnson  [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique...   1438 
4  Ben Thompson  [Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique...    1101
5          Mike  [Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique...    1498
6   Tom Johnson  [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique...    1439

Rakesh · Accepted Answer · 2019-06-25T13:26:31.083

Using str.findall

Ex:

from pandas import DataFrame

Info = {'ID': ['1','2','3'],
    'Name': ['Tom Johnson', 'Ben Thompson', 'Mike'],
    'Information': ["[Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique ID] 1438 [Unique ID] 1439","[Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique ID] 1101","[Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique ID] 1498"]}

df = DataFrame(Info,columns= ['ID', 'Name', 'Information'])
df['UID'] = df["Information"].str.findall(r"\[Unique ID\]\s*(\d+)")
#Ref https://stackoverflow.com/a/48532692/532312
lst_col = 'UID'
df = pd.DataFrame({
      col:np.repeat(df[col].values, df[lst_col].str.len())
      for col in df.columns.drop(lst_col)}
    ).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
print(df)

Output:

  ID          Name                                        Information   UID
0  1   Tom Johnson  [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique...  1424
0  1   Tom Johnson  [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique...  1438
0  1   Tom Johnson  [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique...  1439
1  2  Ben Thompson  [Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique...  1101
1  2  Ben Thompson  [Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique...  1101
2  3          Mike  [Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique...  1424
2  3          Mike  [Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique...  1498

Thanks. That is why I added the Ref to the logic so that OP can choose what they want. — Rakesh, Jun 25 '19 at 13:27

Python Pandas, All occurrences of text after multiple of the same delimiter

2 Answers2