For each ID, print order of items from second column in new data frame (R or Python)

Question

I have a dataframe such as below, that is already sorted:

id   desc
1   GAG
1   GAG
1   AP
1   RT
1   RT
2   ABC
2   ABC
2   DEF
2   GHI
2   ABC

From this, I want to generate a second dataframe containing the ID in one column followed by a second column showing the order that "DESC" items appear for each ID. I also do not want duplicates IF they appear in order.

Desired output:

id   ORDER

1    GAG-AP-RT
2    ABC-DEF-GHI-ABC

If the same DESC appears twice for the same ID, but not next to each other, I would like it to be included.

I have tried iterating through each row in R, generating a temporary dataframe containing only entries linked to a single ID, however I have been unable to include each DESC once in the order they appear.

For my solution using R, I have "premade" an output table called "record", which contains one row for each ID. This has allowed me to get part way to a solution:

for (i in 1:length(record$id)) {
+   tmp <- input[input$id == record$id[i],]
+   record$ORDER[i] <- paste(tmp$desc, collapse = "-")
+ }

This resulted in:

id    ORDER
1    GAG-GAG-AP-RT-RT
2    ABC-ABC-DEF-GHI-ABC

Any help would be much appreciated!

UPDATE

I think it is nearly there, I now have:

for (i in 1:length(record$id)) {
+   tmp <- input[input$id == record$id[i],]
+   record$ORDER[i] <- paste(unique(tmp$desc), collapse = "-")
+ }

However this has removed occurrences of the same string in desc that are separated by a different string. I need to remove occurrence of the same string next to each other, but not when separated by a different one.

Questions should be focused on a single language and show your attempt in that language. — G. Grothendieck, Jun 15 '20 at 13:21
When I used unique() it removed second occurrences of the same string that were separated by a different string. Is there a way around this? — toby_b, Jun 15 '20 at 13:27
I think it would be easier to visit the link I duped your question with and use one of those methods — Sotos, Jun 15 '20 at 13:34

score 0 · Answer 1 · answered Jun 15 '20 at 13:31

Here's one way to do this:

import pandas as pd
from io import StringIO


df = pd.read_csv(StringIO("""
id   desc
1   GAG
1   GAG
1   AP
1   RT
1   RT
2   ABC
2   ABC
2   DEF
2   GHI
2   ABC
"""
), sep='\s+')

def merge_desc(descs):
    result = []
    for desc in descs:
        if not result or result[-1] != desc:
            result.append(desc)
    return '-'.join(result)

df2 = df.groupby('id')['desc'].apply(merge_desc).reset_index()
df2.rename(columns={'desc': 'ORDER'}, inplace=True)
print(df2)

Output

   id            ORDER
0   1        GAG-AP-RT
1   2  ABC-DEF-GHI-ABC

For each ID, print order of items from second column in new data frame (R or Python)

UPDATE

1 Answers1