0

Say I have variables with multiple time points and I want to do some operation for all the time points. How do I do this in a more efficient way than doing it for each individual time point? In the examples below I want to 1) get a sum for selected columns for each time point, and 2) for each variable, see how much it changes from baseline to all the time points

#fake data to show what the dataset I receive looks like:
library(reshape2)
id=rep(c(1,1,1,2,2,2,3,3,3), 3)            
time=c(rep("Time1",9), rep("Time2",9), rep("Time3",9))
test=rep(c("calcium","magnesium","zinc"), 9) 
score=rnorm(n = 27, mean = 10, sd = 3)
fake <- data.frame(id, time, test, score)
fake <- dcast(fake, id ~ time + test)

#Task 1- Get total of selected columns at each time point
#Non-efficient method:
fake$totalmgcad1 <- rowSums(fake[,c("Time1_calcium", "Time1_magnesium")])
fake$totaldmgca2 <- rowSums(fake[,c("Time2_calcium", "Time2_magnesium")])
fake$totaldmgca3 <- rowSums(fake[,c("Time3_calcium", "Time3_magnesium")])


#Task 2 - Get change in calcium levels from baseline to each day
#Non-efficient method:
fake$calciumt1t2 <- fake$Time2_calcium - fake$Time1_calcium
fake$calciumt1t3 <- fake$Time3_calcium - fake$Time1_calcium

Any tips for how I can do the above in fewer lines? Is there a way to use group_by() for this, or do I need to make lists and use lapply()?

CineyEveryday
  • 127
  • 1
  • 8

2 Answers2

1

For me, a good start would be keeping the original data in long/tidy format, something like:

library(tidyverse)

id <- c(rep(1,3), rep(2,3), rep(3,3))
set.seed(1) # for reproducible sample values
value <- rnorm(9)
param <- c(rep("calcium", 3), rep("magnesium", 3), rep("zinc", 3))
time  <- rep(c(1,2,3), 3)
df <- data.frame(id, value, param, time)
as_tibble(df) #convenient way to see the data
# A tibble: 9 x 4
#     id  value   param      time
#     <dbl> <dbl> <fct>      <dbl>
#1     1  -0.626 calcium       1
#2     1   0.184 calcium       2
#3     1  -0.836 calcium       3
#4     2   1.60  magnesium     1
#5     2   0.330 magnesium     2
#6     2  -0.820 magnesium     3
#7     3   0.487 zinc          1
#8     3   0.738 zinc          2
#9     3   0.576 zinc          3

and then if you're looking for fewer lines, you could define a function in another file (say in function_defs.r), something like difference_from_baseline(), so in your original file you could do something like operated_on_desired_data <- difference_from_baseline(df) in one line in your main working file, once you find the right existing functions for your math.

dbo
  • 1,174
  • 1
  • 11
  • 19
  • sorry I should have specified in the original post...the data generation code is to show an example of what my data looks like. My actual data is not created by me, i.e. there is no "original" data that I can keep in any particular form. I am just showing how to create a dataset that looks like what I am working with (which is in wide format) – CineyEveryday Jun 19 '19 at 00:43
  • 2
    I see - though I'd still probably convert the format to long as many libraries like the tidyverse are built around it - check `gather()` – dbo Jun 19 '19 at 00:50
1

You might first consider leaving your data in long format; that is, stop at:

fake <- data.frame(id, time, test, score)

and don't dcast.

Now you can use dplyr functions.

library(dplyr)

For example, to add a column for the change in baseline levels for all tests:

fake %>% 
  arrange(time) %>% 
  group_by(id, test) %>% 
  mutate(test_diff = score - lag(score))

To add a column for the calcium + magnesium sum at each time:

fake %>% 
  group_by(id, time) %>% 
  filter(test != "zinc") %>% 
  summarise(total_mgca = sum(score)) %>% 
  right_join(fake)

Both together:

fake %>% 
  group_by(id, time) %>% 
  filter(test != "zinc") %>% 
  summarise(total_mgca = sum(score)) %>% 
  ungroup() %>% 
  right_join(fake) %>% 
  arrange(time) %>% 
  group_by(id, test) %>% 
  mutate(test_diff = score - lag(score)) %>%
  ungroup()

Result:

   id  time total_mgca      test     score   test_diff
1   1 Time1   21.64788   calcium 12.296461          NA
2   1 Time1   21.64788 magnesium  9.351419          NA
3   1 Time1   21.64788      zinc  6.897300          NA
4   2 Time1   25.16516   calcium 11.026712          NA
5   2 Time1   25.16516 magnesium 14.138449          NA
6   2 Time1   25.16516      zinc  4.462579          NA
7   3 Time1   15.39817   calcium  5.778935          NA
8   3 Time1   15.39817 magnesium  9.619240          NA
9   3 Time1   15.39817      zinc  4.976049          NA
10  1 Time2   29.97949   calcium 11.152820  -1.1436409
11  1 Time2   29.97949 magnesium 18.826667   9.4752480
12  1 Time2   29.97949      zinc  8.280754   1.3834534
13  2 Time2   32.65905   calcium 16.469051   5.4423387
14  2 Time2   32.65905 magnesium 16.190000   2.0515508
15  2 Time2   32.65905      zinc 10.781192   6.3186129
16  3 Time2   14.24311   calcium  3.843355  -1.9355800
17  3 Time2   14.24311 magnesium 10.399755   0.7805155
18  3 Time2   14.24311      zinc  7.868311   2.8922628
19  1 Time3   23.26662   calcium  9.325816  -1.8270041
20  1 Time3   23.26662 magnesium 13.940803  -4.8858643
21  1 Time3   23.26662      zinc 13.984667   5.7039133
22  2 Time3   16.67828   calcium  5.142377 -11.3266742
23  2 Time3   16.67828 magnesium 11.535903  -4.6540968
24  2 Time3   16.67828      zinc 13.057014   2.2758226
25  3 Time3   25.09958   calcium 14.158592  10.3152371
26  3 Time3   25.09958 magnesium 10.940988   0.5412329
27  3 Time3   25.09958      zinc 11.229914   3.3616030
neilfws
  • 32,751
  • 5
  • 50
  • 63
  • The code was just to generate fake data that looks like the data I have. The actual data I'm analyzing is not something I create. – CineyEveryday Jun 19 '19 at 18:19
  • Shouldn’t matter provided your real data resembles the example. If not then you should post an example of the real data that you are working with. – neilfws Jun 19 '19 at 20:51
  • I think you're confused. I can't "stop at" any particular point because the data is sent to me. I do not create it. I posted an example of something similar to the real data I had, and to achieve that I used code to create the fake data. That is the standard way of showing data on this website. – CineyEveryday Jun 19 '19 at 21:06
  • There are several ways to post data. A good method is dput(). See [this guide](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – neilfws Jun 19 '19 at 21:11