0

Consider a data frame with two numeric columns and a categorical column containing a string:

d1 <- data.frame(x = c(0, 1, 2, 5, 6.5, 8), y = c(0, 2, 3, 5, 5.5, 5), category = "ValueA")
d2 <- data.frame(x = c(0, 1, 2, 4, 6, 8), y = c(0, 3, 3.5, 4, 4, 5), category = "ValueB")
df <- rbind(d1, d2)
> df
     x   y category
1  0.0 0.0   ValueA
2  1.0 2.0   ValueA
3  2.0 3.0   ValueA
4  5.0 5.0   ValueA
5  6.5 5.5   ValueA
6  8.0 5.0   ValueA
7  0.0 0.0   ValueB
8  1.0 3.0   ValueB
9  2.0 3.5   ValueB
10 4.0 4.0   ValueB
11 6.0 4.0   ValueB
12 8.0 5.0   ValueB

I want to append a number (as a prefix) to the values of the category column, which is sequentially increasing for different categorical values ("ValueA", "ValueB", ...). My take using dplyr:

library(dplyr)
diff <- unique(df$category)
for(i in 1:length(diff)) {
  if(i == 1) {
    results.df <- subset(df, category == diff[i]) %>% mutate(category = paste0(as.character(i), sep = ".", diff[i]))
  }
  else {
   appender.df <- subset(df, category == diff[i]) %>% mutate(category = paste0(as.character(i), sep = ".", diff[i]))
   results.df <- rbind(results.df, appender.df)
  }  
}
> results.df
     x   y category
1  0.0 0.0 1.ValueA
2  1.0 2.0 1.ValueA
3  2.0 3.0 1.ValueA
4  5.0 5.0 1.ValueA
5  6.5 5.5 1.ValueA
6  8.0 5.0 1.ValueA
7  0.0 0.0 2.ValueB
8  1.0 3.0 2.ValueB
9  2.0 3.5 2.ValueB
10 4.0 4.0 2.ValueB
11 6.0 4.0 2.ValueB
12 8.0 5.0 2.ValueB

This works fine, but are there any better approaches? Making a data.frame for each distinct category string (like I'm doing within my loop) seems overkill, especially when I would be dealing with a large number of unique values in category. (I'm using two here for a minimal example!)

I'm pretty sure there are better ways to modify the string values directly (perhaps operations within the data frame?), but I'm lacking this knowledge. Any answers/pointers would be appreciated!

1 Answers1

1

You definitely should not use for and if to achieve this with R.

Here is an option with data.table:

library(data.table)
setDT(df)
df[, category.id := .GRP, category]
df[, category.label := paste0(category.id, ".", category)]

Result:

     x   y category category.id category.label
     1: 0.0 0.0   ValueA           1       1.ValueA
     2: 1.0 2.0   ValueA           1       1.ValueA
     3: 2.0 3.0   ValueA           1       1.ValueA
     4: 5.0 5.0   ValueA           1       1.ValueA
     5: 6.5 5.5   ValueA           1       1.ValueA
     6: 8.0 5.0   ValueA           1       1.ValueA
     7: 0.0 0.0   ValueB           2       2.ValueB
     8: 1.0 3.0   ValueB           2       2.ValueB
     9: 2.0 3.5   ValueB           2       2.ValueB
    10: 4.0 4.0   ValueB           2       2.ValueB
    11: 6.0 4.0   ValueB           2       2.ValueB
    12: 8.0 5.0   ValueB           2       2.ValueB
Bulat
  • 6,869
  • 1
  • 29
  • 52