0

I have a dataframe with two columns: company name and tags in a long format. There are a variable number of tags attached to each company, resulting in about 80k rows:

Company   Tags

Company A Tag A
Company A Tag B
Company A Tag C
Company B Tag A
Company B Tag B
Company B Tag C
Company B Tag D
Company B Tag E
Company B Tag F
Company C Tag A
Company C Tag B
Company C Tag C
Company C Tag D

I want to turn it into a wide format:

Company   Tag 1   Tag 2   Tag 3   Tag 4    Tag 5   Tag 6 

Company A Tag A   Tag B   Tag C
Company B Tag A   Tag B   Tag C   Tag D    Tag E   Tag F
Company C Tag A   Tag B   Tag C   Tag D

Spread doesn't work, because it's expecting me to pass it a column that will become the column names in wide format, but I don't have one. So I can't spread it as is. It seems like I have two options:

  1. Create another column with numbers attached to each company to serve as column names in wide format. But I don't know how to do that in code for every company.

  2. Find a package with a function that can cast the dataframe wide more flexibly than Spread can. Splitstackshape does this well for turning dataframes into long format, but not the other way around.

Any advice would be appreciated! Also, I'd love to learn how to format these tables better without having to do a bunch of manual tabs/spacing.

pez
  • 7
  • 6
  • Option 1 is the way to go. – Mikko Marttila Feb 18 '18 at 19:45
  • [Transpose / reshape dataframe without “timevar” from long to wide format](https://stackoverflow.com/questions/11322801/transpose-reshape-dataframe-without-timevar-from-long-to-wide-format) – Henrik Feb 18 '18 at 19:53
  • Thank you, Mikko! I was envisioning a solution like this but didn't know how to implement it. I'm a little surprised row_number() works with group_by - it's not like the row numbers of the larger dataframe are changing. – pez Feb 19 '18 at 09:43
  • @pez you can think of `group_by()` dividing your data into independent subsets that all other verbs will then operate on as if they were separate data frames. In that context, the `row_number()` behaviour makes perfect sense. – Mikko Marttila Feb 20 '18 at 08:09

4 Answers4

4

Your option 1 is the way to go; but given just the information you've provided, there's no way to tell what column the values should go to (are the Tag <num> columns created based on the order or the value of the original Tags column?). However, if we assume the new columns are created from the ordering, you can just group_by the Company value and use row_number() to create a variable to serve as the desired column names after spreading to wide format.

library(tidyverse)

df <- read.table(
  header = T,
  sep = ",",
  text = '
Company,Tags
Company A,Tag A
Company A,Tag B
Company A,Tag C
Company B,Tag A
Company B,Tag B
Company B,Tag C
Company B,Tag D
Company B,Tag E
Company B,Tag F
Company C,Tag A
Company C,Tag B
Company C,Tag C
Company C,Tag D')

df %>% 
  group_by(Company) %>% 
  mutate(tag_column = str_c("Tag", row_number())) %>% 
  spread(tag_column, Tags)
#> # A tibble: 3 x 7
#> # Groups:   Company [3]
#>   Company   Tag1  Tag2  Tag3  Tag4  Tag5  Tag6 
#>   <fct>     <fct> <fct> <fct> <fct> <fct> <fct>
#> 1 Company A Tag A Tag B Tag C <NA>  <NA>  <NA> 
#> 2 Company B Tag A Tag B Tag C Tag D Tag E Tag F
#> 3 Company C Tag A Tag B Tag C Tag D <NA>  <NA>

Created on 2018-02-18 by the reprex package (v0.2.0).

Mikko Marttila
  • 10,972
  • 18
  • 31
2

for your example, you could consider just using table, which produces a somewhat similar result

table(df)
#            Tags
# Company     Tag A Tag B Tag C Tag D Tag E Tag F
# Company A     1     1     1     0     0     0
# Company B     1     1     1     1     1     1
# Company C     1     1     1     1     0     0

If you now want each value to represent the column name, you could run

tb <- table(df)
z <- which(tb==1, arr.ind=T)
tb[z]<-colnames(tb)[z[,2]]
tb
 #         Tags
 # Company    Tag A Tag B Tag C Tag D Tag E Tag F
 #  Company A Tag A Tag B Tag C 0     0     0    
 #  Company B Tag A Tag B Tag C Tag D Tag E Tag F
 #  Company C Tag A Tag B Tag C Tag D 0     0    
Daniel
  • 2,207
  • 1
  • 11
  • 15
2

Another simpler option could be to use dcast from reshape2. If OP is interested to rename the columns after transformation then it would be easier as well. One has to just create a vector to have different names for columns Tag A to Tag F

library(reshape2)
> dcast(df, Company ~ Tags)
#    Company Tag A Tag B Tag C Tag D Tag E Tag F
#1 Company A Tag A Tag B Tag C  <NA>  <NA>  <NA>
#2 Company B Tag A Tag B Tag C Tag D Tag E Tag F
#3 Company C Tag A Tag B Tag C Tag D  <NA>  <NA>


# Data
df <- read.table( text = "Company   Tags
'Company A' 'Tag A'
'Company A' 'Tag B'
'Company A' 'Tag C'
'Company B' 'Tag A'
'Company B' 'Tag B'
'Company B' 'Tag C'
'Company B' 'Tag D'
'Company B' 'Tag E'
'Company B' 'Tag F'
'Company C' 'Tag A'
'Company C' 'Tag B'
'Company C' 'Tag C'
'Company C' 'Tag D'", header = T, stringsAsFactor = F)
MKR
  • 19,739
  • 4
  • 23
  • 33
0

How about this?

data <- read.table(text="Company Tag
                   CompanyA TagA
                   CompanyA TagB
                   CompanyA TagC
                   CompanyB TagA
                   CompanyB TagB
                   CompanyB TagC
                   CompanyB TagD
                   CompanyB TagE
                   CompanyB TagF
                   CompanyC TagA
                   CompanyC TagB
                   CompanyC TagC
                   CompanyC TagD
                   ",header=TRUE)

library(reshape2)
d = dcast(data,Company~Tag,value.var = "Tag")
names(d)[grep("^Tag", names(d))] = paste("Tag", 1:(ncol(d)-1), sep = "" )

Output :

   Company Tag1 Tag2 Tag3 Tag4 Tag5 Tag6
1 CompanyA TagA TagB TagC <NA> <NA> <NA>
2 CompanyB TagA TagB TagC TagD TagE TagF
3 CompanyC TagA TagB TagC TagD <NA> <NA>
john
  • 1,026
  • 8
  • 19