1

I have a data frame of depths for a number of different stations in a lake, and I want every station to have a complete sequence of depths from the min to the max (and missing values filled with NAs).

I am using tidyr::complete to do this, but it is behaving oddly. When my depths are rounded to zero decimal places, the code runs as expected, but when the data are to the tenth of a meter, something odd happens to the class of depth and some combinations are completed (and values filled with NA) even though I already have data for that depth.

Has anyone experienced this before? I assume it has something to do with the class of depth, but I haven't quite figured it out or how to avoid it.

library(dplyr)

b <- data.frame(site = c(rep("A", 10), rep("B", 10)),
                depth = c(seq(0.1, 0.8, 0.1), 1.0, 1.1, seq(0.3, 0.5, 0.1), seq(0.9, 1.5, 0.1)),
                value = round(runif(20, 0, 5), 1))

b2 <- b %>% 
  mutate(site = factor(site)) %>% 
  group_by(site) %>% 
  tidyr::complete(depth = seq(min(depth),
                              max(depth),
                              by = 0.1)) %>% 
  arrange(site, depth)

Some depths from the original data frame are duplicated, which is unexpected.

class(b2$depth) 
unique(b2$depth)
b2[b2$site == "B", ] 

When I convert out of numeric and back to numeric, depth seems to have reverted to what I would expect, although I still need to remove the duplicated depths with NAs.

class(as.numeric(as.character(b2$depth))) 
unique(as.numeric(as.character(b2$depth)))

If depths have no decimal places, the behaviour seems more predictable.

a <- data.frame(site = c(rep("A", 10), rep("B", 10)),
                depth = c(1:4, 6:11, 3:5, 8, 10, 12:16),
                value = round(runif(20, 0, 5), 1))

a2 <- a %>% 
  mutate(site = factor(site)) %>% 
  group_by(site) %>% 
  tidyr::complete(depth = seq(min(depth),
                              max(depth),
                              by = 1)) %>% 
  arrange(site, depth)

class(a2$depth)
unique(a2$depth)
a2[a2$site == "B", ] 
  • 1
    You may want to review how `complete()` works. The `complete()` function as you have it, sets "depth" to a sequence to of numbers from min to max increasing by 0.1 by group. This creates new rows for the numbers that didn't exist before. It then fills with NA all empty cells that were introduced by new sequence. "b2$depth" is a number (the sequence) and the class is as expected. It is in order too. Perhaps you can edit your question to include what you expect. I'm thinking `complete()` may not be what you're looking for. – guasi Jul 08 '22 at 19:52
  • I guess you want the NAs to appear in the value column, not among the depths? In that case specify `fill = list(value = NA)` (**value**, not **depth**). Does that solve the issue? –  Jul 08 '22 at 20:03
  • @I_O The NAs are appearing in the value not the depth. Your question made me realize that they are being introduced not as a result of `fill` but automatically as a result of new observations introduced by the the sequence. – guasi Jul 08 '22 at 20:46
  • @guasi what you are describing is indeed what I want. I realized that the fill(depth = NA) command was redundant (@I_O tipped me off) and I have removed it from the code. I don't agree with you that b2$depth is a number and it's in order. When I look at the column using unique() there are repeated numbers. And when I look at site B, depth 1.2, I don't understand why there are two rows for that site/depth combo. You're second comment is making me wonder if the seq() portion is adding them. If I run a similar experiment with depths as whole numbers, these issues don't appear. – Annika Elsie Jul 08 '22 at 22:39

1 Answers1

1

Indeed, complete() is producing extra unexplained elements.

I think, when complete() matches the numbers produced by the sequence and the numbers in the original data, it evaluates (1.4 == 1.4) as FALSE and returns them as two different numbers. I did a small test and I indeed got FALSE too! To avoid this we need to round and make sure the "depth" values in the original dataframe and those created by seq() are the same.

b <- data.frame(site = c(rep("A", 10), rep("B", 10)),
                depth = c(seq(0.1, 0.8, 0.1), 1.0, 1.1, seq(0.3, 0.5, 0.1), seq(0.9, 1.5, 0.1)),
                value = round(runif(20, 0, 5), 1))

b$depth <- round(b$depth,2) #crucial to do it here and not via mutate()

b2 <- b %>%
  mutate(site = factor(site)) %>%
  group_by(site) %>% 
  tidyr::complete(depth = round(seq(min(depth),
                                max(depth),
                                by = 0.1), 
                                2)) %>% 
  ungroup() 

output of site B to check all is good.

> b2[b2$site == "B",]
# A tibble: 13 × 3
   site  depth value
   <chr> <chr> <dbl>
 1 B     0.3     3.4
 2 B     0.4     0.7
 3 B     0.5     4.5
 4 B     0.6    NA  
 5 B     0.7    NA  
 6 B     0.8    NA  
 7 B     0.9     1.4
 8 B     1       1.2
 9 B     1.1     1.3
10 B     1.2     3.3
11 B     1.3     2.1
12 B     1.4     0.9
13 B     1.5     1.2

The problem, as I understand it, is due to the way seq() stores the values it produces. Notice how the values are not exact:

> v2 = seq(.3,1.4, by = .1)
> print(v2, digits =16)
 [1] 0.3000000000000000 0.4000000000000000 0.5000000000000000 0.6000000000000001
 [5] 0.7000000000000000 0.8000000000000000 0.9000000000000001 1.0000000000000000
 [9] 1.1000000000000001 1.2000000000000000 1.3000000000000000 1.3999999999999999

Since 1.3999... is not equals to 1.4, complete() did not join the two 1.4 values. That's how the duplicates got produced.

guasi
  • 1,461
  • 3
  • 12
  • Wow, really interesting. I had no idea seq() behaved this way. Thanks for the great explanation @guasi. – Annika Elsie Jul 10 '22 at 16:31
  • 1
    Super interesting--I learned a lot trying to figure out your problem! And I learned the issue is not specific to `seq()`, it's general to R's floating point arithmetic. This was useful https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal – guasi Jul 10 '22 at 18:31