0

I run into this scenario often:

I have a designed experiment where each experimental unit is assigned a treatment value, and I have a data frame depicting the experimental design (df.design). The final data file (df.data) may have different numbers of measurements, and only the ID of the experimental unit, not the treatment variables.

My goal is to be able to create a treatment variable in df.data, based on df.design

in other words df.data$treatment should equal df.design$treatment anywhere df.data$ID==df.design$ID even though length(df.data$ID) != length(df.design$ID)

Because this is a common scenario with very different data sets, I'm hoping for a universal or at least easily modify among scenarios.

I have tried solutions along the lines of:

df.data$treatment <- case_when(
   df.data$ID[i] == df.design$ID[i] ~ df.design$treatment[i])

But this returns <NA>s

example df.desgin

> head(df.design)
  ID                    treatment
1  1   No heat - Keep all foliage
2  2    Heat - Remove new foliage
3  3 No heat - Remove old foliage
4  4 No heat - Remove new foliage
5  5    Heat - Remove old foliage
6  6      Heat - Keep all foliage

Example df.data:

> head(df.data)
  obs ID subsample         A
1   1  1       New       1.3
2  13  2       New       3.3
3  12  2    Mature       1.1
4  14  3    Mature       3.8
5  15  4    Mature       3.4
6  16  5    Mature       2.0

You'll see that some, but not all IDs have multiple measurements ("A"). These measurements come from an instrument that produces the data with labels "obs", "ID", and "subsample".

  • Can you add example data for `df.data` and `df.design` and what you expect your final output to be? I'm currently not following what you want. – Amar Apr 07 '18 at 04:23
  • I've revised the post to include headers of the particular dataframes I'm using. Hopefully they are sufficiently informative. Thanks for considering. – Christopher Vincent Apr 07 '18 at 04:40
  • Still lot of information is missing. Where are the values in column 'A' coming from? Also, only 1 value of matching ID is given (ID =1). You need to provide a more clear example of what is the expected output. Does not have to be with actual data but it could be with dummy data that clearly defines your problem. Also, look into `match` function in R. – tushaR Apr 07 '18 at 04:57
  • @ChristopherVincent please have a look at how to make [reproducible examples in R](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – kangaroo_cliff Apr 07 '18 at 05:05
  • 1
    This operations is called *join* or *merge*. Please, try `merge(df.data, df.design, by = "ID")` – Uwe Apr 07 '18 at 05:37
  • Although the context is similar, the result of a merge, actually results in either missing rows from the longer df or in blank values in the ID variable that the longer data frame repeats. This results in the need for other more specific fixes, which I want to avoid. – Christopher Vincent Apr 08 '18 at 02:13

1 Answers1

1

I've created some data and then came up with a solution.

set.seed(1)
df_design <- data.frame(ID = 1:10, treatment = runif(10))
df_data <- data.frame(ID = sample(1:10, 4))

df_design
# ID  treatment
# 1   1 0.26550866
# 2   2 0.37212390
# 3   3 0.57285336
# 4   4 0.90820779
# 5   5 0.20168193
# 6   6 0.89838968
# 7   7 0.94467527
# 8   8 0.66079779
# 9   9 0.62911404
# 10 10 0.06178627

df_data
# 
# ID
# 1  3
# 2  2
# 3  6
# 4 10

df_data$treatment <- sapply(1:nrow(df_data), 
                      function(x) df_design[df_design$ID %in% df_data$ID[x], "treatment"])

df_data
# ID  treatment
# 1  3 0.57285336
# 2  2 0.37212390
# 3  6 0.89838968
# 4 10 0.06178627
kangaroo_cliff
  • 6,067
  • 3
  • 29
  • 42