0

I want to clean a column of an R dataframe that contains a mixture of values:

chromosome_count
26
54
c.36
28-30
12, 24

so that it looks like this, with comma separated values split into two rows and keeping only the minimum values where a range is recorded:

chromosome_count
26
54
36
28
12
24

I'm a very stumped beginner and any advice would be very appreciated.

2 Answers2

1

You could use regular expressions. ie remove from the string the -30 ie use a look behind, and if its a number, delete the end part of the range. This solution assumes the range is ordered min-max. Also delete anything from the start of a line that is not a digit

df %>%
  mutate(chromosome_count = str_remove(chromosome_count, "(?<=\\d)-\\d+|^\\D+")) %>%
  separate_rows(chromosome_count, convert = TRUE)

# A tibble: 6 x 1
  chromosome_count
             <int>
1               26
2               54
3               36
4               28
5               12
6               24
r2evans
  • 141,215
  • 6
  • 77
  • 149
Onyambu
  • 67,392
  • 3
  • 24
  • 53
-1
require(tidyverse)

df <- tibble(
  chromo = c(26, 
             54, 
             "c.36", 
             "28-30", 
             "12, 24")
) 

df %>% 
  mutate(chromo = chromo %>%
           str_replace_all("(?<=-).*", "")) %>% 
  separate_rows(chromo, convert = TRUE) %>%
  mutate(chromo = chromo %>%
           str_replace_all("[^0-9]", "") %>% 
           parse_number()) %>% 
  drop_na()

# A tibble: 6 x 1
  chromo
   <dbl>
1     26
2     54
3     36
4     28
5     12
6     24
Chamkrai
  • 5,912
  • 1
  • 4
  • 14
  • require/library: https://stackoverflow.com/a/51263513/3358272, https://yihui.org/en/2014/07/library-vs-require/, https://r-pkgs.org/namespace.html#search-path – r2evans Apr 29 '22 at 23:56
  • The question asks that with a range (`28-30`), only the first/min value should be retained. – r2evans Apr 30 '22 at 16:42