How do I convert a data frame where some rows are duplicates except for one column

Question

I have data on taxpayers which have multiple branches which produces multiple rows which are almost duplicates except for one relevant column (enterprise activity). I want to change this so that each taxpayer has only one row, which involves creating columns 'enterprise_activity_1', 'enterprise_activity_2' and so on.

I realise this is similar to reshaping but I can't think of a way to use tidyr::spread to achieve this.

For simplicity, say we just have a dataframe like:

df <- tibble::tibble(
  TAXPAYER_ID = c(100, 151, 250, 250, 267, 296, 296, 304),
  ENTERPRISE_ACTIVITY = rep(c("AGRICULTURE", "MANUFACTURING"), 4)
)

What I would like to achieve is this:

TAXPAYER_ID ENTERPRISE_ACTIVITY_1   ENTERPRISE_ACTIVITY_2
100         AGRICULTURE             NA
151         MANUFACTURING           NA
250         AGRICULTURE             MANUFACTURING
267         AGRICULTURE             NA
296         MANUFACTURING           AGRICULTURE
304         MANUFACTURING           NA

My actual data has varying numbers of branches per taxpayer so the number of columns should be the maximum number of branches that one taxpayer has.

Thanks for pointing to that question but it's not quite the same as my data does not have a column that provides an index for each branch within a taxpayer. But if I could create that index then I could use tidyr::spread straightforwardly as in that question. — Jonathan Aron, May 29 '19 at 12:10

score 0 · Accepted Answer · answered May 29 '19 at 12:10

Basically you need to group by the taxpayer ID, create a column to handle duplicated identifiers and spread, i.e.

library(tidyverse)

df %>% 
  group_by(TAXPAYER_ID) %>% 
  mutate(ENTERPRISE_ACT = row_number()) %>% 
  spread(key = ENTERPRISE_ACT, ENTERPRISE_ACTIVITY, sep = '_')

# A tibble: 6 x 3
# Groups:   TAXPAYER_ID [6]
#  TAXPAYER_ID ENTERPRISE_ACT_1 ENTERPRISE_ACT_2
#        <dbl> <chr>            <chr>           
#1         100 AGRICULTURE      <NA>            
#2         151 MANUFACTURING    <NA>            
#3         250 AGRICULTURE      MANUFACTURING   
#4         267 AGRICULTURE      <NA>            
#5         296 MANUFACTURING    AGRICULTURE     
#6         304 MANUFACTURING    <NA>

How do I convert a data frame where some rows are duplicates except for one column

1 Answers1