Arithmetic between dataframes with varying numbers of rows in R

Question

I have objects containing monthly data on plant growth. Each object is a fixed number of columns, and the number of rows is equal to the number of months the plant survives. I would like to take the mean of these objects so that the mean considers only plants surviving at a given timestep. Here is example data:

df1 <- data.frame(GPP = 1:10, NPP = 1:10)
df2 <- data.frame(GPP = 2:8, NPP = 2:8)
df3 <- data.frame(GPP = 3:9, NPP = 3:9 )

In this scenario, the maximum timesteps is 10, and the 2nd and 3rd plants did not survive this long. To take the mean, my initial thought was to replace empty space with NA to make the dimensions the same, such as this:

na <- matrix( , nrow = 3, ncol = 2)
colnames(na) <- c("GPP","NPP")
df2 <- rbind(df2, na)
df3 <- rbind(df3, na)

This is not desirable because the NA does not simply ignore the value as I had hoped, but nullifies the field, leading to all outputs of arithmetic with NA becoming NA, such as this:

(df1 + df2 + df3) / 3

I can NOT just fill na with 0s because I want to see the mean of every plant that is living at a given timestep while completely ignoring those that have died. Replacing with 0s would skew the mean, and not achieve this. For my example data here, this is the desired outcome

(df1 + df2 + df3) / 3

Here rows 8-10 are replaced with the values from df1 because there are only 7 rows in both df2 and df3.

Strong suggestion: if your data is monthly, then put the month in the data. While you may "know" that each frame starts at month 1 (or 0), it is far better programmatically to remove that silent assumption and have it explicitly in the data. That will enable many other processes and methods that are much less fragile to what you're attempting here, and will prevent big "oopses" later in your data pipeline. — r2evans, May 25 '21 at 17:06
@Limey, typo: it's typically `na.rm=`, not `rm.na=` ... though I'm not certain which function *used in code here* would be using that argument. (The OP does mention calculating the mean, but not using `mean` ...) — r2evans, May 25 '21 at 17:08
@r2evans I am working in model development. I could add months as placeholders, but they would be meaningless, as the model is simulating on a monthly basis, but entirely hypothetically. The rows of each df do not correspond to actual dates. — sethparker, May 25 '21 at 17:12
I didn't say to encode actual dates, I said to encode the month number that you are assuming with each row. Take it or leave it, but I've lost a lot of time to debugging *inferred timelines* when the order of data was inadvertently (silently) changed mid-processing. (I toiled over a month on my graduate thesis data-wrangling by making this same exact mistake.) Over to you, I feel not being explicit with "time" makes your data fragile. — r2evans, May 25 '21 at 17:16
@r2evans i had issues using ```mean``` in my current data format, so had been adding together and dividing by a value ```nb.iter```, here 3, to achieve a mean. — sethparker, May 25 '21 at 17:16
@r2evans Both points well taken. I am forever getting the `rm` and the `na` the wrong way round. Also true regarding failure to mention the mean function. But I feel it's a reasonable assumption. BTW: I always value your contributions. — Limey, May 25 '21 at 17:17
@r2evans I understand what you're saying now. Looking at the solution proposed by Zulkifli, I will likely take your advice and write a loop to add the month number to each dataframe. My thesis has enough toil without adding more. Thanks! — sethparker, May 25 '21 at 17:19
"I had issues using mean in my current data format". Thereby hangs a tale. I've learnt that if I find myself thinking along those lines, then the root of the problem is likely to be my data format, and I should take a step back. @Zulkifli seems to have given you a reasonable solution. — Limey, May 25 '21 at 17:20

score 2 · Accepted Answer · answered May 25 '21 at 17:22

I'll restate my comment: it is generally much safer to encode the month in the original data before you do anything else; it is explicit and will insulate you from mistakes later in the pipeline that might inadvertently change the order of rows (which completely breaks any valid significance you hope to attain). Additionally, since I'm going to recommend putting all data into one frame, let's encode the plant number as well (even if we don't use it immediately here).

For that, then:

df1 <- data.frame(plant = "A", month = 1:10, GPP = 1:10, NPP = 1:10)
df2 <- data.frame(plant = "B", month = 1:7, GPP = 2:8, NPP = 2:8)
df3 <- data.frame(plant = "C", month = 1:7, GPP = 3:9, NPP = 3:9)

From this, I'm a huge fan of having all data in one frame. This is well-informed by https://stackoverflow.com/a/24376207/3358227, where a premise is that if you're going to do the same thing to a bunch of frames, it should either be a list-of-frames or one combined frame (that keeps the source id encoded):

dfs <- do.call(rbind, list(df1, df2, df3))
### just a sampling, for depiction
dfs[c(1:2, 10:12, 17:19),]
#    plant month GPP NPP
# 1      A     1   1   1
# 2      A     2   2   2
# 10     A    10  10  10
# 11     B     1   2   2
# 12     B     2   3   3
# 17     B     7   8   8
# 18     C     1   3   3
# 19     C     2   4   4

base R

aggregate(cbind(GPP, NPP) ~ month, data = dfs, FUN = mean, na.rm = TRUE)
#    month GPP NPP
# 1      1   2   2
# 2      2   3   3
# 3      3   4   4
# 4      4   5   5
# 5      5   6   6
# 6      6   7   7
# 7      7   8   8
# 8      8   8   8
# 9      9   9   9
# 10    10  10  10

dplyr

library(dplyr)
dfs %>%
  group_by(month) %>%
  summarize(across(c(GPP, NPP), mean))
# # A tibble: 10 x 3
#    month   GPP   NPP
#    <int> <dbl> <dbl>
#  1     1     2     2
#  2     2     3     3
#  3     3     4     4
#  4     4     5     5
#  5     5     6     6
#  6     6     7     7
#  7     7     8     8
#  8     8     8     8
#  9     9     9     9
# 10    10    10    10

Side point: two bits of data you are "losing" in this summary is the size of data and the variability of each month. You might include them with:

dfs %>%
  group_by(month) %>%
  summarize(across(c(GPP, NPP), list(mu = ~ mean(.), sigma = ~ sd(.), len = ~ length(.))))
# # A tibble: 10 x 7
#    month GPP_mu GPP_sigma GPP_len NPP_mu NPP_sigma NPP_len
#    <int>  <dbl>     <dbl>   <int>  <dbl>     <dbl>   <int>
#  1     1      2         1       3      2         1       3
#  2     2      3         1       3      3         1       3
#  3     3      4         1       3      4         1       3
#  4     4      5         1       3      5         1       3
#  5     5      6         1       3      6         1       3
#  6     6      7         1       3      7         1       3
#  7     7      8         1       3      8         1       3
#  8     8      8        NA       1      8        NA       1
#  9     9      9        NA       1      9        NA       1
# 10    10     10        NA       1     10        NA       1

In this case, an average of 8 may be meaningful, but noting that it is a length of 1 is also informative of the "strength" of that statistic (i.e., weak).

score 1 · Answer 2 · edited May 25 '21 at 17:16

1

library(dplyr)

df1 <- data.frame(month = 1:10, GPP = 1:10, NPP = 1:10)
df2 <- data.frame(month = 1:7, GPP = 2:8, NPP = 2:8)
df3 <- data.frame(month = 1:7, GPP = 3:9, NPP = 3:9 )

df <- rbind(df1, df2, df3)

df %>%
  group_by(month) %>%
  summarise(GPP = mean(GPP),
            NPP = mean(NPP))

   month   GPP   NPP
   <int> <dbl> <dbl>
 1     1     2     2
 2     2     3     3
 3     3     4     4
 4     4     5     5
 5     5     6     6
 6     6     7     7
 7     7     8     8
 8     8     8     8
 9     9     9     9
10    10    10    10

edited May 25 '21 at 17:16

henryn

1,163
4
15

answered May 25 '21 at 17:11

Zulkifli

94
3

did you do this with dplyr? – sethparker May 25 '21 at 17:14
oh, my bad. yes group_by and summarise are the functions in dplyr – Zulkifli May 25 '21 at 17:15

score 0 · Answer 3 · answered May 25 '21 at 17:26

0

Using data.table

library(data.table)
rbindlist(mget(ls(pattern = '^df\\d+$')))[, lapply(.SD, mean), month]

answered May 25 '21 at 17:26

akrun

874,273
37
540
662

Arithmetic between dataframes with varying numbers of rows in R

3 Answers3

base R

dplyr