1

Let's say I have the following dataset:

ID  Type  Group      Week    Value
111 A      Pepper     -1      10
112 B      Salt        2      20
113 C      Curry       4      40
114 D      Rosemary    9      90
211 A      Pepper     -1      15
212 B      Salt        2      30
214 D      Rosemary    9      135

Where ID, Type and Group as well as Week are entered in a measurement instrument measuring "value" each week. Sometimes there are multiple results per week so the initial tidying was to create a mean for each weekly measurement.

I would like to

a) create a dataset where the rows are automatically inserted where there are empty lines in the Week-column so it looks like this - always with the Type order A, B, C, D and Group order Pepper, Salt, Curry, Rosemary and Week -1, 2, 4, 9.

ID  Type  Group      Week    Value
111 A      Pepper     -1      10
112 B      Salt        2      20
113 C      Curry       4      40
114 D      Rosemary    9      90
211 A      Pepper     -1      15
212 B      Salt        2      30
213 C      Curry       4      60
214 D      Rosemary    9      135

b) The objective is to calculate the difference between the measured values in a vertical plane only for each group ie:

ID  Type  Group      Week    Value  Diff
111 A      Pepper     -1      10     NA
112 B      Salt        2      20     10
113 C      Curry       4      40     20 
114 D      Rosemary    9      90     50
211 A      Pepper     -1      15     NA
212 B      Salt        2      30     15
213 C      Curry       4      60     30
214 D      Rosemary    9      135    75

I can see how to do this in a for loop but there must be a more elegant way?

Emil Olsen
  • 35
  • 3
  • How is it known which weeks are missing. Are the weeks only available for -1, 2, 4, 9 in repeat? Should it carry over the previous `Type`, and `Group` from the previous available data for that week (and impute the missing `ID`)? – Ben Jan 10 '20 at 20:02
  • Thank you Ben yes post corrected. Weeks only available for -1, 2, 4, 9 in repeat and there are randomly missing lines. The Type and group will vary with more types and more groups in the real data-set. – Emil Olsen Jan 10 '20 at 21:41
  • Thanks - and how was it determined that `Curry` inserted would have value `60`? – Ben Jan 10 '20 at 21:49
  • Ah perfect Ben, thank you again and sorry for the lack of clarity. The values are measured every week and to make it better the original dataset sometimes has multiple repeats per week and sometimes not so I have summarised it to a mean per week to make it a little simpler to process. – Emil Olsen Jan 11 '20 at 11:10
  • Hi Emil - wish I could be of more help. If you don't get any other responses or comments, I might do a new post and provide the additional details for clarification. Also, take a look at this [description for making a minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Best of luck! – Ben Jan 11 '20 at 19:41
  • Thank you Ben this actually does the trick only for some reason the lag function doesn't give the NA on the first line in every group but otherwise I'll hit the solved. – Emil Olsen Jan 12 '20 at 16:38
  • The missing NA may have to do with grouping which in the example is every 4 rows — would you be able to share part of the data where this happens? – Ben Jan 12 '20 at 16:49

1 Answers1

0

I'm not sure this will be helpful, but thought it might be a start.

If you have repeating groups of rows, I might create a generic data frame and repeat it multiple times, then join with your available data set. This will effectively insert rows that are missing.

Also, if you use tidyverse you can calculate the diff by using lag.

Note this will not give the exact same result as I was not sure where 60 for Curry came from (will edit answer later).

library(tidyverse)

# Define number of repeating groups
N = 2

# Create generic group of Type, Group, Week
df <- data.frame(
  Type = c("A", "B", "C", "D"),
  Group = c("Pepper", "Salt", "Curry", "Rosemary"),
  Week = c(-1, 2, 4, 9)
)

# Represents the number of rows
nrow_df <- nrow(df)

# Repeat groups of rows N times
full_df <- df[rep(seq_len(nrow_df), times = N), ]

# Add ID numbers
full_df$ID <- rep(seq(110, (100 * N) + 10, by=100), each=nrow_df) + seq(1:nrow_df)

# Second data frame with missing rows
df2 <- read.table(text =
"ID  Type  Group      Week    Value
111 A      Pepper     -1      10
112 B      Salt        2      20
113 C      Curry       4      40
114 D      Rosemary    9      90
211 A      Pepper     -1      15
212 B      Salt        2      30
214 D      Rosemary    9      135", header = T, stringsAsFactors = T)

# Join the data frames and get differences
full_df %>%
  left_join(df2) %>%
  group_by(grp = ceiling(row_number()/nrow_df)) %>%
  mutate(Diff = Value - lag(Value))

# A tibble: 8 x 7
# Groups:   grp [2]
  Type  Group     Week    ID Value   grp  Diff
  <fct> <fct>    <dbl> <dbl> <int> <dbl> <int>
1 A     Pepper      -1   111    10     1    NA
2 B     Salt         2   112    20     1    10
3 C     Curry        4   113    40     1    20
4 D     Rosemary     9   114    90     1    50
5 A     Pepper      -1   211    15     2    NA
6 B     Salt         2   212    30     2    15
7 C     Curry        4   213    NA     2    NA
8 D     Rosemary     9   214   135     2    NA
Ben
  • 28,684
  • 5
  • 23
  • 45