0

I have a dataframe such as below, that is already sorted:

id   desc
1   GAG
1   GAG
1   AP
1   RT
1   RT
2   ABC
2   ABC
2   DEF
2   GHI
2   ABC

From this, I want to generate a second dataframe containing the ID in one column followed by a second column showing the order that "DESC" items appear for each ID. I also do not want duplicates IF they appear in order.

Desired output:

id   ORDER

1    GAG-AP-RT
2    ABC-DEF-GHI-ABC

If the same DESC appears twice for the same ID, but not next to each other, I would like it to be included.

I have tried iterating through each row in R, generating a temporary dataframe containing only entries linked to a single ID, however I have been unable to include each DESC once in the order they appear.

For my solution using R, I have "premade" an output table called "record", which contains one row for each ID. This has allowed me to get part way to a solution:

for (i in 1:length(record$id)) {
+   tmp <- input[input$id == record$id[i],]
+   record$ORDER[i] <- paste(tmp$desc, collapse = "-")
+ }

This resulted in:

id    ORDER
1    GAG-GAG-AP-RT-RT
2    ABC-ABC-DEF-GHI-ABC

Any help would be much appreciated!

UPDATE

I think it is nearly there, I now have:

for (i in 1:length(record$id)) {
+   tmp <- input[input$id == record$id[i],]
+   record$ORDER[i] <- paste(unique(tmp$desc), collapse = "-")
+ }

However this has removed occurrences of the same string in desc that are separated by a different string. I need to remove occurrence of the same string next to each other, but not when separated by a different one.

toby_b
  • 1
  • 3

1 Answers1

0

Here's one way to do this:

import pandas as pd
from io import StringIO


df = pd.read_csv(StringIO("""
id   desc
1   GAG
1   GAG
1   AP
1   RT
1   RT
2   ABC
2   ABC
2   DEF
2   GHI
2   ABC
"""
), sep='\s+')

def merge_desc(descs):
    result = []
    for desc in descs:
        if not result or result[-1] != desc:
            result.append(desc)
    return '-'.join(result)

df2 = df.groupby('id')['desc'].apply(merge_desc).reset_index()
df2.rename(columns={'desc': 'ORDER'}, inplace=True)
print(df2)

Output

   id            ORDER
0   1        GAG-AP-RT
1   2  ABC-DEF-GHI-ABC
Balaji Ambresh
  • 4,977
  • 2
  • 5
  • 17