I want to categories a numeric feature of my data (genomic data, but it doesn't matter now). To do this, I took the min and max of this feature, made a sequence with 0.01 steps, than parcel this sequence to 10 equal groups. Made a list with categories (1-10) as names with corresponding slice of sequences as value.
snv_filt$repl_timing <- round(snv_filt$repl_timing, 2)
min = min(snv_filt$repl_timing, na.rm = TRUE)
max = max(snv_filt$repl_timing, na.rm = TRUE)
sequence <- seq(min, max, by = 0.01)
tibble(value = sequence, key = ntile(sequence, 10)) %>%
group_by_at(vars(-value)) %>% # group by everything other than the value column.
mutate(row_id=1:n()) %>% ungroup() %>% # build group index
spread(key, value) %>% # spread
dplyr::select(-row_id) -> categories
timing_categories <- list(
"1" = categories$`1`[!is.na(categories$`1`)],
"2" = categories$`2`[!is.na(categories$`2`)],
"3" = categories$`3`[!is.na(categories$`3`)],
"4" = categories$`4`[!is.na(categories$`4`)],
"5" = categories$`5`[!is.na(categories$`5`)],
"6" = categories$`6`[!is.na(categories$`6`)],
"7" = categories$`7`[!is.na(categories$`7`)],
"8" = categories$`8`[!is.na(categories$`8`)],
"9" = categories$`9`[!is.na(categories$`9`)],
"10" = categories$`10`[!is.na(categories$`10`)]
)
Then I tried to categories:
snv_filt$strand_group <- NA
for (i in 1:length(timing_categories)) {
snv_filt[which(snv_filt$repl_timing %in% timing_categories[[i]]), "strand_group"] <- names(timing_categories)[i]
print(names(timing_categories)[i])
}
Suprisingly, there were a lot of NA in the new column... When I checked some, for example -0.42, I got this:
> timing_categories$"3"
[1] -0.58 -0.57 -0.56 -0.55 -0.54 -0.53 -0.52 -0.51 -0.50 -0.49 -0.48 -0.47 -0.46 -0.45 -0.44 -0.43 -0.42 -0.41 -0.40 -0.39
> -0.42 %in% timing_categories$"3"
[1] FALSE
What the heck? Is it some weird numeric data-sorting stuff I don't know or what? I would appreciate if you could help me.