0

I have a dataset that contains data in the following manner:

Name, Result, Date #A lot of other stuff as we, but these are the only relevant ones I think
Peter Parker, 150, 2018-03-03
Peter Parker, 155, 2018-03-04
Peter Parker, 156, 2018-03-05
Peter Parker, 154, 2018-03-06
Peter Parker, 158, 2018-03-07
Benny Thompson, 130, 2018-03-03
Benny Thompson, 132, 2018-03-04
Benny Thompson, 138, 2018-03-05
Benny Thompson, 140, 2018-03-07
Benny Thompson, 139, 2018-03-09
Mylo Thony, 177, 2018-03-11

some of the persons are present at least, lets say 5 times. I want to create a model to predict the 5th result with regression if I give it the first four. I therefore assume that all Persons are "behaving" equally and want to transform my data into this format:

150, 155, 156, 154, 158
130, 132, 138, 140, 139

and that only for all persons who at least have 5 results in the record. I have no idea where to start, I come from the Java and C++ department, usually I would just run a for loop over this, but this seems very un-R-like to me, at least I have never seen anything like this yet. What is the best approach to something like this?

monamona
  • 1,195
  • 2
  • 17
  • 43
  • Is the Date sequence identical for all Names? – Jon Spring Jan 05 '23 at 19:35
  • No, it can vary greatly from the 1960's until 2019 – monamona Jan 05 '23 at 19:36
  • I see a few prior answers that might help here. Here's one using a kalman filter: https://stackoverflow.com/questions/60574665/impute-missing-with-interpolation-by-groups – Jon Spring Jan 05 '23 at 19:45
  • This seems to indeed help once you get the data to the format where you have the results in one row, but I am mostly having trouble to getting it there. I will however take a look at the kalman filter! Thanks for the tip. – monamona Jan 05 '23 at 19:53

2 Answers2

2

As far as I understood, the following steps have to be done:

  • remove rows from participants with less than 5 observations
  • transfer data to wide format

Removing rows

By using the which function from base R you can remove the participants with less than 5 observations.

x <- data.frame(Name=c("Peter Parker", "Peter Parker", "Peter Parker", "Peter Parker","Peter Parker", "Benny Thompson", "Benny Thompson", "Benny Thompson", "Benny Thompson", "Benny Thompson", "Mylo Thony"),
           Result=c(150,155, 156, 154, 158,130, 132, 138, 140, 139, 177))
x <- x[with(x, Name %in% names(which(table(Name)>=5))), ]

Transfer to wide format

Your data is currently in long format. It can be easily transfered to wide format using the spread function from the tidyr package. The spread function requires a key column which contains the new column names. If there are only 5 observations per person, you can just add a new column repeating the numbers 1 to 5 as often as there are unique values in your Name column (calculated with n_distinct from the dplyr package). Note that your data should be ordered by both Name and Date, so that the numbers are assigned to the correct observation.

 x$Measurement <- rep(c(1,2,3,4,5), dplyr::n_distinct(x$Name))
 x_wide <- tidyr::spread(x, Measurement, Result)

output

If you have more than 5 observations per person, the solution has to be modified.

  • 1
    Note that tidyr::spread was superseed by "pivot_wider". – Ric Jan 05 '23 at 20:27
  • This is also exactly what I needed, thanks for the quick reply, well deserved upvote. The other answer gave me some more tips on the evaluation, which might be helpful to future users, so I will accept it and "just" leave you with an upvote. – monamona Jan 06 '23 at 09:35
1

Perhaps "row" format is not the best way to perform predictions on your data. I'll give an example with dplyr/tidyr for the data in one row case:

The data:

data <- data.frame(Name = c("Peter Parker", "Peter Parker", "Peter Parker", 
"Peter Parker", "Peter Parker", "Benny Thompson", "Benny Thompson", 
"Benny Thompson", "Benny Thompson", "Benny Thompson", "Mylo Thony"
), Result = c(150L, 155L, 156L, 154L, 158L, 130L, 132L, 138L, 
140L, 139L, 177L), Date = c(" 2018-03-03", " 2018-03-04", " 2018-03-05", 
" 2018-03-06", " 2018-03-07", " 2018-03-03", " 2018-03-04", " 2018-03-05", 
" 2018-03-07", " 2018-03-09", " 2018-03-11"))

data
#>              Name Result        Date
#> 1    Peter Parker    150  2018-03-03
#> 2    Peter Parker    155  2018-03-04
#> 3    Peter Parker    156  2018-03-05
#> 4    Peter Parker    154  2018-03-06
#> 5    Peter Parker    158  2018-03-07
#> 6  Benny Thompson    130  2018-03-03
#> 7  Benny Thompson    132  2018-03-04
#> 8  Benny Thompson    138  2018-03-05
#> 9  Benny Thompson    140  2018-03-07
#> 10 Benny Thompson    139  2018-03-09
#> 11     Mylo Thony    177  2018-03-11

To filter and transforming the data

library(dplyr)
library(tidyr)

data_in_one_row <- data |> group_by(Name) |> 
  mutate(count = n(), id = 1:n()) |>
  filter(count == 5)  |>   
  pivot_wider(id_cols = c(Name), names_from = id, 
              values_from = -c(Name, id))

data_in_one_row

#> # A tibble: 2 × 16
#> # Groups:   Name [2]
#>   Name       Resul…¹ Resul…² Resul…³ Resul…⁴ Resul…⁵ Date_1 Date_2 Date_3 Date_4
#>   <chr>        <int>   <int>   <int>   <int>   <int> <chr>  <chr>  <chr>  <chr> 
#> 1 Peter Par…     150     155     156     154     158 " 201… " 201… " 201… " 201…
#> 2 Benny Tho…     130     132     138     140     139 " 201… " 201… " 201… " 201…
#> # … with 6 more variables: Date_5 <chr>, count_1 <int>, count_2 <int>,
#> #   count_3 <int>, count_4 <int>, count_5 <int>, and abbreviated variable names
#> #   ¹​Result_1, ²​Result_2, ³​Result_3, ⁴​Result_4, ⁵​Result_5

Eventually you can apply a single linear mode prediction, depending on your expectations (on the un transformed dataset).

data <- data |> group_by(Name) |> 
  mutate(count = n(), id = 1:n()) |>
  filter(count == 5) 

model <- lm(Result ~ Name + as.POSIXct(Date), data)

pred_dataset <- data |> group_by(Name) |> 
  summarise(Date = max(as.POSIXct(Date)) + 24 * 3600)

pred_dataset$prediction = predict(model, pred_dataset)

pred_dataset
#> # A tibble: 3 × 3
#>   Name           Date                prediction
#>   <chr>          <dttm>                   <dbl>
#> 1 Benny Thompson 2018-03-10 00:00:00       143.
#> 2 Mylo Thony     2018-03-12 00:00:00       179.
#> 3 Peter Parker   2018-03-08 00:00:00       159.

Ric
  • 5,362
  • 1
  • 10
  • 23