-1

I received data in a csv. from a report pulled from Salesforce. Each row represents a transaction and the person that worked on the transaction. There are multiple cases where more than 1 person worked on the same transaction so in those cases there are duplicate rows that show the date, time stamp, location, office, etc. except for the person's name.

I want each row to represent one transaction and in the case where multiple people worked on the same transaction, a new column would be created to capture the other person's name.

This is what the data looks like:

Trans_ID | Name
================
1        | Dom Toretto
2        | Brian Oconnor
3        | Letty Garcia
3        | Mia Toretto
4        | Brian Oconnor
4        | Dom Toretto
4        | Letty Garcia
4        | Jesse

This is what I want:

Trans_ID | Name          | Name2        | Name3        | Name4
===============================================================
1        | Dom Toretto   | Letty Garcia |              |
2        | Brian Oconnor |              |              |
3        | Letty Garcia  | Mia Toretto  |              |
4        | Brian Oconnor | Dom Toretto  | Letty Garcia | Jesse

I want to do this in R, but if it is easier in SQL or Excel, that would work too.

Thank you!

EDIT So I tried using case when and lag in r to do this but now I have a new problem of deleting the rows that are no longer necessary. This may also be the wrong approach. Any ideas?

df<- data.frame(trans_id = c(1,2,3,3,4,4,4,4), 
Name = c("Dom Toretto","Brian Oconnor","Letty Garcia","Mia Toretto","Brian Oconnor","Dom Toretto","Letty Garcia","Jesse"))


df %>%
  mutate(
  Name2 =  case_when(
      trans_id == lag(trans_id) ~ lag(Name)
),
  Name3 = case_when(
  trans_id == lag(trans_id, 2) ~ lag(Name, 2)
),
  Name4 = case_when(
  trans_id == lag(trans_id, 3) ~ lag(Name, 3)
))
  • Please specify one language? It's currently too broad? **I received data in a csv. from a report pulled from Salesforce** What have you tried since then? – NelsonGon Jul 09 '19 at 16:43
  • see the post [here](https://stackoverflow.com/a/12495471/5874001) on how to get the duplicate indexes and filter out the rows of the duplications afterwards. – InfiniteFlash Jul 09 '19 at 17:43

2 Answers2

1

That would work with nest and map functions, df:

df <- tibble(Trans_ID = c(1,2,3,3,4,4,4,4), 
             Name = c("Dom Toretto", "Brian Oconnor", "Letty Garcia", 
                      "Mia Toretto", "Brian Oconnor", "Dom Toretto", 
                      "Letty Garcia", "Jesse"))

Then nest on Trans ID and map over each dataframe:

 df %>% 
  nest(-Trans_ID) %>% 
  mutate(
    data = map(data, ~ mutate(.x, 
                              col_name = str_c("Name", row_number())) %>% 
                 spread("col_name", "Name"))
  ) %>% 
  unnest()

Returns:

  Trans_ID Name1         Name2       Name3        Name4
     <dbl> <chr>         <chr>       <chr>        <chr>
1        1 Dom Toretto   NA          NA           NA   
2        2 Brian Oconnor NA          NA           NA   
3        3 Letty Garcia  Mia Toretto NA           NA   
4        4 Brian Oconnor Dom Toretto Letty Garcia Jesse

Hope this helps!

tvdo
  • 151
  • 3
0

You have a situation here where each row has a different number of entries. That tells me that most likely you don't want to store this data in a dataframe or matrix-like object. Those objects are primarily for square data, which yours is not. I don't know your use case, but I would probably store this data in a list of character vectors. You can get this using split()

output <- split(df$Name,df$Trans_ID)

If you have reason to store this in a square data object with missing values, there are lots of ways to convert the above. For example, using data table.

library(data.table)
squareoutput <- rbindlist(lapply(output,function(x)data.table(x)),fill=TRUE)
farnsy
  • 2,282
  • 19
  • 22