2

I have a huge file like this:

library(tidyverse)

test <- structure(list(one = c(5014, 5014, 5014, 5033, 5033, 5033, 5040, 
5040, 5040, 5171, 5171, 5171, 5174, 5174, 5174, 5183, 5183, 5183, 
5193, 5193, 5193, 5304, 5304, 5304), two = c(5033, 5040, 5304, 
5014, 5040, 5304, 5014, 5033, 5304, 5174, 5183, 5331, 5171, 5183, 
5331, 5171, 5174, 5331, 5161, 1538, 5190, 5014, 5033, 5040)), row.names = c(NA, 
-24L), class = c("tbl_df", "tbl", "data.frame"))

The numbers in the two columns match to other ones in a (recursive?) way (sorry, I don't know how to explain it better). In this example example, it is obvious (for humans, not for computers) that there are only three groups. How can I separate these groups?

All I need is something like that:

group number
1 5014
1 5033
1 5040
1 5304
2 5171
2 5174
2 5183
2 5331
3 5193
3 5161
3 1538
3 5190

I guess it is possible with multiple self-joins, but this seems to be extremely tedious...

Maël
  • 45,206
  • 3
  • 29
  • 67
Patrick
  • 63
  • 5

2 Answers2

4

With igraph:

library(igraph)
graph_from_data_frame(test) |>
  components() |>
  getElement("membership") |>
  stack() |>
  arrange(values)

output

   values  ind
1       1 5014
2       1 5033
3       1 5040
4       1 5304
5       2 5171
6       2 5174
7       2 5183
8       2 5331
9       3 5193
10      3 5161
11      3 1538
12      3 5190
Maël
  • 45,206
  • 3
  • 29
  • 67
1

This is a bit convoluted, but it works:

library(tidyverse)

test |> 
  group_by(one) |> 
  mutate(n = cur_group_id()) |>
  unite("z", one:two, na.rm = TRUE, remove = T) |> 
  separate_rows(z) |> 
  distinct(z,.keep_all = T) |> 
  group_by(n) |> 
  mutate(n = cur_group_id())
#> # A tibble: 12 × 2
#> # Groups:   n [3]
#>    z         n
#>    <chr> <int>
#>  1 5014      1
#>  2 5033      1
#>  3 5040      1
#>  4 5304      1
#>  5 5171      2
#>  6 5174      2
#>  7 5183      2
#>  8 5331      2
#>  9 5193      3
#> 10 5161      3
#> 11 1538      3
#> 12 5190      3

ADDED after your comment

The problem is that some functions from dplyr might be masked by functions from other packages. One way to overcome this problem is to use dplyr:: before the functions name like so: dplyr::mutate(n = dplyr::cur_group_id()).

The other option is to detach the tidyverse and load the required packages only, here it is dplyr and tidy. That is what I have done with in this case. You see that the code works for both test-DF.

library(dplyr)
library(tidyr)

test2 <- structure(list(one = c(7600, 7606, 7600, 7600, 7600, 8224, 8224, 8228), two = c(1142, 1142, 1162, 1174, 1186, 4942, 4952, 4952)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"))

test2 |>
  group_by(one) |>
  mutate(n = cur_group_id()) |>
  unite("z", one:two, na.rm = TRUE, remove = T) |>
  separate_rows(z) |>
  distinct(z, .keep_all = T) |>
  group_by(n) |>
  mutate(n = dplyr::cur_group_id())
#> # A tibble: 10 × 2
#> # Groups:   n [4]
#>    z         n
#>    <chr> <int>
#>  1 7600      1
#>  2 1142      1
#>  3 7606      2
#>  4 1162      1
#>  5 1174      1
#>  6 1186      1
#>  7 8224      3
#>  8 4942      3
#>  9 4952      3
#> 10 8228      4
MarBlo
  • 4,195
  • 1
  • 13
  • 27
  • Interesting. I tried another sample data where it fails (there are only two groups, maybe this is a special case?), while the solution with igraph works: test2 <- structure(list(one = c(7600, 7606, 7600, 7600, 7600, 8224, 8224, 8228), two = c(1142, 1142, 1162, 1174, 1186, 4942, 4952, 4952 )), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame" )) – Patrick Dec 13 '22 at 19:49
  • @Patrick I have also posted my answer at the reference mentioned above. https://stackoverflow.com/questions/12135971/identify-groups-of-linked-episodes-which-chain-together/74791642#74791642 I think it is good idea, to have all possible answers at one place. You might have a look there and judge upon the answer. – MarBlo Dec 13 '22 at 22:42
  • I like your tidyverse-approach. But your version misses the chain between the numbers 4992-8224-4952-8228. The output of igraph is different: values ind 1 1 7600 2 1 7606 3 1 1142 4 1 1162 5 1 1174 6 1 1186 7 2 8224 8 2 8228 9 2 4942 10 2 4952 – Patrick Dec 14 '22 at 01:20
  • @Patrick you could just do `arrange(n)`. – MarBlo Dec 14 '22 at 01:26
  • It's not the order. Please compare the output of igraph and your version. You create 4 groups, but there should be only 2 groups... – Patrick Dec 14 '22 at 12:39