Consider the following data / example. Each dataset contains a number of samples with one observation and one estimate:
library(tidyverse)
library(broom)
data = read.table(text = '
dataset sample_id observation estimate
A A1 4.8 4.7
A A2 4.3 4.5
A A3 3.1 2.9
A A4 2.1 2
A A5 1.1 1
B B1 4.5 4.3
B B2 3.9 4.1
B B3 2.9 3
B B4 1.8 2
B B5 1 1.2
', header = TRUE)
I want to calculate a linear model per dataset to remove any linear bias between observation and estimate, and get the fitted values next to the original ones:
data %>%
group_by(dataset) %>%
do(lm(observation ~ estimate, data = .) %>% augment)
What this does, however, is remove the sample_id
column, which I need to keep in order to continue calculations with this dataset based on that unique ID:
# A tibble: 10 x 10
# Groups: dataset [2]
dataset observation estimate .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 4.80 4.70 4.68 0.107 0.115 0.478 0.152 0.491 1.04
2 A 4.30 4.50 4.49 0.0996 -0.193 0.416 0.0609 0.957 -1.64
3 A 3.10 2.90 2.97 0.0693 0.135 0.201 0.156 0.120 0.976
4 A 2.10 2.00 2.11 0.0849 -0.00583 0.303 0.189 0.000444 -0.0452
5 A 1.10 1.00 1.15 0.120 -0.0508 0.602 0.180 0.206 -0.521
6 B 4.50 4.30 4.31 0.109 0.191 0.468 0.0597 1.20 1.65
7 B 3.90 4.10 4.09 0.100 -0.193 0.396 0.0844 0.798 -1.56
8 B 2.90 3.00 2.91 0.0713 -0.00630 0.201 0.195 0.000247 -0.0443
9 B 1.80 2.00 1.83 0.0898 -0.0275 0.319 0.193 0.0103 -0.210
10 B 1.00 1.20 0.964 0.125 0.0355 0.616 0.191 0.104 0.360
How can I keep the additional column from my original dataset?
I have seen this answer which uses nest
to collapse the data before, but I still only get the model parameters using this approach. I guess I could extract the parameters per dataset:
data %>%
group_by(dataset) %>%
nest() %>%
mutate(
mod = map(data, linear_adj_model),
pars = map(mod, tidy)
) %>%
unnest(pars) %>%
select(dataset, term, estimate) %>%
spread(term, estimate)
… which gives me this:
# A tibble: 2 x 3
dataset `(Intercept)` estimate
* <fct> <dbl> <dbl>
1 A 0.196 0.955
2 B -0.330 1.08
… and then left-join that with the original data, and then mutate
each estimate
to get the linearly-adjusted one, but that seems much too complicated.
Another ugly hack I've found consists in adding the column as a dummy variable to the model:
data %>%
group_by(dataset) %>%
do(lm(observation ~ estimate + 0 * sample_id, data = .) %>% augment)
Is there an easier (tidy) solution that does not involve manually specifying the variables I want to keep?