Minus values from columns relative to year

Question

I'm trying to minus values for each habitat covariate relative to year 2019 and 2010. So, something that can assign by ID those values belonging to each habitat for 2010 and 2019, minus them, otherwise, those that aren't grouped by ID are left as is in the dataframe.

Here's an example of the dataset and what I expect for the output:

#dataset example

# A tibble: 30 x 18
      id year  pland_00_water pland_01_evergr~ pland_02_evergr~ pland_03_decidu~ pland_04_decidu~ pland_05_mixed_~ pland_06_closed~
   <int> <chr>          <dbl>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>
 1   267 2019          0.0833                0                0                0                0                0                0
 2   268 2019          0.2                   0                0                0                0                0                0
 3   362 2019          0.1                   0                0                0                0                0                0
 4   420 2019          0.0556                0                0                0                0                0                0
 5   421 2019          0.0667                0                0                0                0                0                0
 6   484 2019          0.125                 0                0                0                0                0                0
 7   492 2010          0.1                   0                0                0                0                0                0
 8   492 2019          0.1                   0                0                0                0                0                0
 9   719 2010          0.0769                0                0                0                0                0                0
10   719 2019          0.0769                0                0                0                0                0                0


#output example

# A tibble: 30 x 18
      id year  pland_00_water pland_01_evergr~ pland_02_evergr~ pland_03_decidu~ pland_04_decidu~ pland_05_mixed_~ pland_06_closed~
   <int> <chr>          <dbl>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>
 1   267 2019          0.0833                0                0                0                0                0                0
 2   268 2019          0.2                   0                0                0                0                0                0
 3   362 2019          0.1                   0                0                0                0                0                0
 4   420 2019          0.0556                0                0                0                0                0                0
 5   421 2019          0.0667                0                0                0                0                0                0
 6   484 2019          0.125                 0                0                0                0                0                0
 7   492 changed        0                     0                0                0                0                0                0

 9   719 changed        0                     0                0                0                0                0                0

I can imagine this working with a function and boolean operators such that, if year 2010 & 2019 match by id then minus the next row by the previous (assuming that they're ordered by id then this should work), otherwise, if they do not match by id then leave them as is.

I'm trying to wrap my head around which code to use for this, I can see this working within a function and using lapply to apply across the entire dataset.

Here's a reproducible code:

structure(list(id = c(267L, 268L, 362L, 420L, 421L, 484L, 492L, 
492L, 719L, 719L, 986L, 986L, 1071L, 1071L, 1303L, 1303L, 1306L, 
1399L, 1399L, 1400L, 1400L, 2007L, 2083L, 2083L, 2134L, 2135L, 
2136L, 2213L, 2213L, 2214L), year = c(2019, 2019, 2019, 2019, 
2019, 2019, 2010, 2019, 2010, 2019, 2010, 2019, 2010, 2019, 2010, 
2019, 2010, 2010, 2019, 2010, 2019, 2019, 2010, 2019, 2019, 2019, 
2019, 2010, 2019, 2010), pland_00_water = c(0.0833333333333333, 
0.2, 0.1, 0.0555555555555556, 0.0666666666666667, 0.125, 0.1, 
0.1, 0.0769230769230769, 0.0769230769230769, 0.0588235294117647, 
0.0588235294117647, 0.0714285714285714, 0.0714285714285714, 0.0769230769230769, 
0.0769230769230769, 0.0588235294117647, 0.05, 0.05, 0.111111111111111, 
0.111111111111111, 0.0526315789473684, 0.142857142857143, 0.142857142857143, 
0.0666666666666667, 0.0588235294117647, 0.1, 0.142857142857143, 
0.142857142857143, 0.25), pland_01_evergreen_needleleaf = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0588235294117647, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), pland_02_evergreen_broadleaf = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0), pland_03_deciduous_needleleaf = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0714285714285714, 0, 0, 
0, 0, 0.05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), pland_04_deciduous_broadleaf = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0714285714285714, 0.0714285714285714, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), pland_05_mixed_forest = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0), pland_06_closed_shrubland = c(0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0), pland_07_open_shrubland = c(0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0), pland_08_woody_savanna = c(0, 0, 0, 0, 0, 0, 
0, 0, 0.0769230769230769, 0.0769230769230769, 0.0588235294117647, 
0.0588235294117647, 0.0714285714285714, 0.0714285714285714, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), pland_09_savanna = c(0, 
0, 0, 0, 0, 0, 0, 0, 0.0769230769230769, 0.0769230769230769, 
0.0588235294117647, 0.0588235294117647, 0, 0, 0, 0.0769230769230769, 
0.0588235294117647, 0.05, 0.05, 0.111111111111111, 0.111111111111111, 
0, 0, 0, 0, 0, 0, 0, 0, 0), pland_10_grassland = c(0.0833333333333333, 
0.2, 0.1, 0.0555555555555556, 0.0666666666666667, 0.125, 0.1, 
0.1, 0.0769230769230769, 0.0769230769230769, 0.0588235294117647, 
0.0588235294117647, 0.0714285714285714, 0.0714285714285714, 0.0769230769230769, 
0.0769230769230769, 0.0588235294117647, 0.05, 0.05, 0.111111111111111, 
0.111111111111111, 0.0526315789473684, 0.142857142857143, 0.142857142857143, 
0.0666666666666667, 0.0588235294117647, 0.1, 0.142857142857143, 
0.142857142857143, 0.25), pland_11_wetland = c(0.0833333333333333, 
0.2, 0.1, 0.0555555555555556, 0, 0, 0.1, 0.1, 0.0769230769230769, 
0.0769230769230769, 0.0588235294117647, 0.0588235294117647, 0.0714285714285714, 
0.0714285714285714, 0.0769230769230769, 0.0769230769230769, 0.0588235294117647, 
0.05, 0.05, 0.111111111111111, 0, 0.0526315789473684, 0.142857142857143, 
0.142857142857143, 0.0666666666666667, 0.0588235294117647, 0.1, 
0.142857142857143, 0.142857142857143, 0), pland_12_cropland = c(0.0833333333333333, 
0.2, 0.1, 0.0555555555555556, 0.0666666666666667, 0.125, 0.1, 
0.1, 0.0769230769230769, 0.0769230769230769, 0.0588235294117647, 
0, 0, 0, 0.0769230769230769, 0.0769230769230769, 0.0588235294117647, 
0.05, 0.05, 0.111111111111111, 0.111111111111111, 0.0526315789473684, 
0.142857142857143, 0.142857142857143, 0.0666666666666667, 0, 
0, 0.142857142857143, 0.142857142857143, 0.25), pland_13_urban = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0), pland_14_mosiac = c(0, 0, 0, 0, 0, 0, 
0, 0, 0.0769230769230769, 0.0769230769230769, 0, 0.0588235294117647, 
0, 0, 0, 0, 0, 0.05, 0.05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
    pland_15_barren = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 
-30L), class = c("tbl_df", "tbl", "data.frame"))

score 1 · Accepted Answer · answered Jan 11 '21 at 14:55

Here's a tidyverse version:

library(dplyr)

x %>%
  arrange(year) %>%
      # can add 'id' if desired, minimum 'year' required for below
  group_by(id) %>%
  filter(
    all(c("2010", "2019") %in% year),
    year %in% c("2010", "2019")
  ) %>%
  summarize_at(vars(-year), diff) %>%
  mutate(year = "changed") %>%
  ungroup() %>%
  bind_rows(x, .) %>%
  arrange(id, year)           # just to show id=492
# # A tibble: 39 x 18
#       id year  pland_00_water pland_01_evergr~ pland_02_evergr~ pland_03_decidu~ pland_04_decidu~ pland_05_mixed_~
#    <int> <chr>          <dbl>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>
#  1   267 2019          0.0833                0                0                0                0                0
#  2   268 2019          0.2                   0                0                0                0                0
#  3   362 2019          0.1                   0                0                0                0                0
#  4   420 2019          0.0556                0                0                0                0                0
#  5   421 2019          0.0667                0                0                0                0                0
#  6   484 2019          0.125                 0                0                0                0                0
#  7   492 2010          0.1                   0                0                0                0                0
#  8   492 2019          0.1                   0                0                0                0                0
#  9   492 chan~         0                     0                0                0                0                0
# 10   719 2010          0.0769                0                0                0                0                0
# # ... with 29 more rows, and 10 more variables: pland_06_closed_shrubland <dbl>, pland_07_open_shrubland <dbl>,
# #   pland_08_woody_savanna <dbl>, pland_09_savanna <dbl>, pland_10_grassland <dbl>, pland_11_wetland <dbl>,
# #   pland_12_cropland <dbl>, pland_13_urban <dbl>, pland_14_mosiac <dbl>, pland_15_barren <dbl>

Explanation:

the first arrange(year) is so that the diff later will have values in an expected order (assuming all years are year-like that sort lexicographically the same as a numerical sort);
the filter first removes any ids that do not have both years, and then ensures we have only those two years; while your data only contains "2010" and "2019", I didn't want to assume that ... it's a harmless filter if that's all you have, remove year %in% c("2010","2019") if desired and safe;
I assume that columns other than id and year are numeric/integer, so summarize_at(vars(-year), diff) is safe (id is out of the picture since it is a grouping variable); if there are non-numerical values, you might be able to use summarize_if(is.numeric, diff) which also works here ... but will silently NA-ize non-numeric fields if present;
bind_rows(x, .) is needed because the filter removed many rows we want/need to retain; and
the last arrange(id,year) is solely demonstrative for this answer.

When I get to ```bind_rows(x, .)``` it returns this error for me: ```Error: Can't combine `..1$year` and `..2$year` .``` — Lime, Jan 11 '21 at 15:12
That's because you changed your sample data from `"year"` as a string (which is what I used) to a numeric. You choose: if you want the numbers, then change the one line to `mutate(year = NA_real_)`. If you want `"changed"`, then don't try to use numbers. — r2evans, Jan 11 '21 at 15:20
Thank you, I just noticed this before your reply. I was only trying to provide an example by using the ```character: changed```, however, I should have thought it through more concisely. Nevertheless, you've given be a good foundation on how to progress this code further. Thank you. — Lime, Jan 11 '21 at 15:26
Glad it helps, either way works. I thought about commenting on the use of strings for `"year"` (since it is reflecting a continuous or ordinal variable), but thought your need for the literal `"changed"` supplanted that preference. I prefer `integer` to `character`, and `numeric` is close enough for most things. With that, I recommend going with `year=NA_real_` for the differencing. If you want the "changed" label somewhere, you can add another column that contains `"raw"` and `"diff"` labels ... or similar. — r2evans, Jan 11 '21 at 15:30

Minus values from columns relative to year

1 Answers1

Linked