1

I have a large dataset (one million rows and ~ 300 columns) in wide format. The dataset contains different metrics such as revenue, cost etc. for multiple products. Unfortunately the dataset comes in wide format. So the variables like revenue or costs are not a single column. Instead there is a column for each products revenue/ costs etc.

For example the columns are called "product1_revenue", "product2_revenue", "product1_costs", "product2_costs" and so on.

I want to transform the dataset into to long format, so I can properly work with it.

I can achieve the transformation for one variable "total_revenue". This works (except the fact, that I cannot keep the id) but I want to this for all other metrics as well.

  select(ends_with("_total_revenue")) %>%
  gather(key=product,value="total_revenue") %>%
  mutate(product=str_replace(product,"_total_revenue",""))

### Trying to keep the IDs does not work:
dataset %>%
  select(ends_with("_total_revenue"),id) %>%
  gather(key=product,value="total_revenue") %>%
  mutate(product=str_replace(product,"_total_revenue",""))

### I want something like this (if it would work of course)

i<-c("_total_revenue","_total_cost")
for(ends_with(colnames(dataset),i) in i) 
{
dataset %>%
  select(ends_with(!!i),id) %>%
  gather(key=product,value=!!i) %>%
  mutate(product=str_replace(product,!!i,""))
  print(i)
}
KevR
  • 131
  • 7

1 Answers1

0

Assuming you have data something like this :

df <- data.frame(id = 1:5, product1_cost  = 11:15, product2_cost = 16:20, 
                 product1_revenue = 21:25, product2_revenue = 26:30)
df
#  id product1_cost product2_cost product1_revenue product2_revenue
#1  1            11            16               21               26
#2  2            12            17               22               27
#3  3            13            18               23               28
#4  4            14            19               24               29
#5  5            15            20               25               30

We could use functions from tidyr library to convert the data in wide format.

library(tidyr)

df %>%
  gather(key, value, -id) %>%
  separate(key, into = c("product", "key"), sep = "_") %>%
  spread(key, value)

#   id  product cost revenue
#1   1 product1   11      21
#2   1 product2   16      26
#3   2 product1   12      22
#4   2 product2   17      27
#5   3 product1   13      23
#6   3 product2   18      28
#7   4 product1   14      24
#8   4 product2   19      29
#9   5 product1   15      25
#10  5 product2   20      30
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Hi @ronak shah, thanks for the quick answer. Running your code results in the following error message: Error: Each row of output must be identified by a unique combination of keys. It seems like this is a quite common problem? – KevR Aug 24 '19 at 10:51
  • @KevinR yes, but the solution depends on how your data is structured. My guess would be to try `df %>% gather(key, value, -id) %>% separate(key, into = c("product", "key"), sep = "_") %>% group_by(id) %>% mutate(row = row_number()) %>% spread(key, value)` – Ronak Shah Aug 24 '19 at 10:57
  • Hi @ronak shah, this seems to work. But what I do not understand is that I receive a lot of NAs running this code. I checked for NAs and the dataset does not contain any missing values. – KevR Aug 24 '19 at 11:19
  • @KevinR `NA`s could be present in the data if there are some unequal values for certain level. Did you check the values are they being converted correctly in long format or there is some mismatch ? – Ronak Shah Aug 24 '19 at 11:57
  • actually the link to the other question and the answers given there solved my problem – KevR Aug 26 '19 at 15:28