Group similar strings with numbers and keep order of first appearance

Question

I have a dataframe which looks like this example (just much larger):

var <- c('Peter','Ben','Mary','Peter.1','Ben.1','Mary.1','Peter.2','Ben.2','Mary.2')
v1 <- c(0.4, 0.6, 0.7, 0.3, 0.9, 0.2, 0.4, 0.6, 0.7)
v2 <- c(0.5, 0.4, 0.2, 0.5, 0.4, 0.2, 0.1, 0.4, 0.2)
df <- data.frame(var, v1, v2)

      var  v1  v2
1   Peter 0.4 0.5
2     Ben 0.6 0.4
3    Mary 0.7 0.2
4 Peter.1 0.3 0.5
5   Ben.1 0.9 0.4
6  Mary.1 0.2 0.2
7 Peter.2 0.4 0.1
8   Ben.2 0.6 0.4
9  Mary.2 0.7 0.2

I want to group the strings in 'var' according to the names without the suffixes, and keep the original order of first appearance. Desired output:

      var  v1  v2
1 Peter   0.4 0.5 # Peter appears first in the original data
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4     Ben 0.6 0.4 # Ben appears second in the original data
5   Ben.1 0.9 0.4
6   Ben.2 0.6 0.4
7    Mary 0.7 0.2 # Mary appears third in the original data
8  Mary.1 0.2 0.2
9  Mary.2 0.7 0.2

How can I achieve that?

Thank you!

Do you mean `dplyr::arrange(df, var)` or `df[order(df$var), ]`? — Maël, Sep 05 '22 at 16:55
If the order of unique elements is important `df %>% mutate(var1 = str_remove(var, "\\.\\d+$")) %>% arrange(factor(var1, levels = unique(var1))) %>% select(-var1)` — akrun, Sep 05 '22 at 16:59
This solves the problem for the example I provided, but in my original datastet the numbers behind the '.' range from 1 to 150, so that the result looks like this: 'Peter, Peter1, Peter10, Peter100, ...', however I want to have an consecutive order of the numbers. I will edit my question so it is clearer. @Maël — Mina, Sep 05 '22 at 17:03

score 2 · Accepted Answer · answered Sep 05 '22 at 17:08

An option is to create a temporary column without the . and the digits (\\d+) at the end with str_remove, then use factor with levels specified as the unique values or use match to arrange the data

library(dplyr)
library(stringr)
df <- df %>%
   mutate(var1 = str_remove(var, "\\.\\d+$")) %>% 
   arrange(factor(var1, levels = unique(var1))) %>%
   select(-var1)

Or use fct_inorder from forcats which will convert to factor with levels in the order of first appearance

library(forcats)
df %>% 
   arrange(fct_inorder(str_remove(var, "\\.\\d+$")))

-output

     var  v1  v2
1   Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4     Ben 0.6 0.4
5   Ben.1 0.9 0.4
6   Ben.2 0.6 0.4
7    Mary 0.7 0.2
8  Mary.1 0.2 0.2
9  Mary.2 0.7 0.2

Henrik · Answer 2 · 2022-09-05T23:04:23.143

2

Compact option with sub and data.table::chgroup

df[chgroup(sub("\\..", "", df$var)),]

      var  v1  v2
1   Peter 0.4 0.5
4 Peter.1 0.3 0.5
7 Peter.2 0.4 0.1
2     Ben 0.6 0.4
5   Ben.1 0.9 0.4
8   Ben.2 0.6 0.4
3    Mary 0.7 0.2
6  Mary.1 0.2 0.2
9  Mary.2 0.7 0.2

chgroup groups together duplicated values but retains the group order (according the first appearance order of each group), efficiently

edited Sep 05 '22 at 23:04

answered Sep 05 '22 at 19:57

Henrik

65,555
14
143
159

G. Grothendieck · Answer 3 · 2022-09-06T00:00:48.497

separate the var column into two columns, replace the NAs that get generated with 0, sort and remove the extra columns.

This works on the numeric value of the numbers rather than the character representation so that for example, 10 won't come before 2. Also, the match in arrange ensures that the order is based on the first occurrence order.

df %>%
  separate(var, c("alpha", "no"), convert=TRUE, remove=FALSE, fill="right") %>% 
  mutate(no = replace_na(no, 0)) %>%
  arrange(match(alpha, alpha), no) %>%
  select(-alpha, -no)

giving

      var  v1  v2
1   Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4     Ben 0.6 0.4
5   Ben.1 0.9 0.4
6   Ben.2 0.6 0.4
7    Mary 0.7 0.2
8  Mary.1 0.2 0.2
9  Mary.2 0.7 0.2

Update

Have removed what was previously the first solution after reading the update to the question.

score 1 · Answer 4 · answered Sep 05 '22 at 18:31

If you don't mind that the values in var are ordered alphabetically, then the simplest solution is this:

df %>%
  arrange(var)
      var  v1  v2
1     Ben 0.6 0.4
2   Ben.1 0.9 0.4
3   Ben.2 0.6 0.4
4    Mary 0.7 0.2
5  Mary.1 0.2 0.2
6  Mary.2 0.7 0.2
7   Peter 0.4 0.5
8 Peter.1 0.3 0.5
9 Peter.2 0.4 0.1

Group similar strings with numbers and keep order of first appearance

4 Answers4

Update