0

I want to categories a numeric feature of my data (genomic data, but it doesn't matter now). To do this, I took the min and max of this feature, made a sequence with 0.01 steps, than parcel this sequence to 10 equal groups. Made a list with categories (1-10) as names with corresponding slice of sequences as value.

snv_filt$repl_timing <- round(snv_filt$repl_timing, 2)

min = min(snv_filt$repl_timing, na.rm = TRUE)
max = max(snv_filt$repl_timing, na.rm = TRUE)
sequence <- seq(min, max, by = 0.01)

tibble(value = sequence, key = ntile(sequence, 10)) %>%
  group_by_at(vars(-value)) %>%  # group by everything other than the value column. 
  mutate(row_id=1:n()) %>% ungroup() %>%  # build group index
  spread(key, value) %>%    # spread
  dplyr::select(-row_id) -> categories

timing_categories <- list(
  "1" = categories$`1`[!is.na(categories$`1`)],
  "2" = categories$`2`[!is.na(categories$`2`)],
  "3" = categories$`3`[!is.na(categories$`3`)],
  "4" = categories$`4`[!is.na(categories$`4`)],
  "5" = categories$`5`[!is.na(categories$`5`)],
  "6" = categories$`6`[!is.na(categories$`6`)],
  "7" = categories$`7`[!is.na(categories$`7`)],
  "8" = categories$`8`[!is.na(categories$`8`)],
  "9" = categories$`9`[!is.na(categories$`9`)],
  "10" = categories$`10`[!is.na(categories$`10`)] 
  )

Then I tried to categories:

snv_filt$strand_group <- NA
for (i in 1:length(timing_categories)) {
  snv_filt[which(snv_filt$repl_timing %in% timing_categories[[i]]), "strand_group"] <- names(timing_categories)[i]  
  print(names(timing_categories)[i])
}

Suprisingly, there were a lot of NA in the new column... When I checked some, for example -0.42, I got this:

> timing_categories$"3"

[1] -0.58 -0.57 -0.56 -0.55 -0.54 -0.53 -0.52 -0.51 -0.50 -0.49 -0.48 -0.47 -0.46 -0.45 -0.44 -0.43 -0.42 -0.41 -0.40 -0.39

> -0.42 %in% timing_categories$"3"

[1] FALSE

What the heck? Is it some weird numeric data-sorting stuff I don't know or what? I would appreciate if you could help me.

zsoltgy
  • 11
  • 1
  • 3
  • Surprisingly, computers aren't very good at counting with floating point numbers. Some decimal values cannot be represented accurately in binary values. Doing exact tests of equality with floating point numbers is generally not a good idea. Often you get better accuracy with something like `sequence <- seq(min*100, max*100, by = 1)/100` – MrFlick Feb 03 '23 at 14:51

0 Answers0