extracting data into a series of new dataframes based on an identifier in R

Question

I am trying to do something that I am sure is easy, but I don't even know what to search for properly to find the soultion.

I have a long time series of data. One column indicates state, which has a value of 0, 1, 2 or -1. I want to extract and work on the data where state = 2. It is no problem to extract all this data; what I want to do is treat each period where state=2 individually.That is, I want to go through the data, and where state=2, do some analysis (eg a linear fit to the initial period on other variables associated with this time period), then move on to the next instance where state=2 and repeat this analysis. Or, extract each period of data to its own dataframe and do the analysis (but this will create hundereds of small dataframes).

I can't even work out how to identify the start points (ie where i=2 and i-1=1).

Any pointers to the commands or packages that I should be looking at would be greatly appreciated.

Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. — Sotos, Sep 04 '18 at 12:03
You might want to look into the `group_by` function in the `dplyr` package. — Kerry Jackson, Sep 04 '18 at 12:06
Dear biosferik, it would be nice if you post the sample of initial data and its structure, how are you handling it now and what will expect to get as an output. Without it it is quite difficult to grasp the problem. — Artem, Sep 09 '18 at 09:20

Artem · Answer 1 · 2018-09-15T20:42:57.783

You can use apply function with MARGIN = 1 (by row) then you will be able access to each result by index, as a value of apply function is a list. Please see as below:

# Simulation of the data ----
library(lubridate)

set.seed(123)
n <- 20
df <- data.frame(
  id = factor(sample(-1:2, 20, replace = TRUE)),
  t_stamp = sort(sample(seq(dmy_hm("01-01-2018 00:00"), dmy_hm("01-01-2019 00:00"), by = 10), 20)),
  matrix(rnorm(100 * n), nrow = n, dimnames = list(NULL, c(paste0("X", 1:50), paste0("Y", 1:50)))))

# Calculation & output -----
# filter out all id == 2
df_m <- df[df$id == 2, ]

# apply to all the data linear regression ----
df_res <- apply(df_m, 1, function(xs) {
  rw <- as.numeric(xs[-(1:2)])
  x <- rw[1:50]
  y <- rw[51:100]
  smry <- summary(lm(y ~ x))
  list(xs[2], smry)

})

# access to the second incident
df_res[[2]]
# [[1]]
# t_stamp 
# "2018-03-26 13:01:40" 
# 
# [[2]]
# 
# Call:
#  lm(formula = y ~ x)
#
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -2.44448 -0.79877  0.09427  0.88524  3.11190 
# 
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)
# (Intercept)   0.1731     0.1592   1.088    0.282
# x             0.1595     0.1544   1.033    0.307
# 
# Residual standard error: 1.125 on 48 degrees of freedom
# Multiple R-squared:  0.02173, Adjusted R-squared:  0.001349 
# F-statistic: 1.066 on 1 and 48 DF,  p-value: 0.307

# check the second incident
df_m[2, ]
# id             t_stamp         X1       X2        X3        X4   ...        X50        Y1         Y2       Y3        Y4
# 4  2 2018-03-26 13:01:40 -0.7288912 2.168956 -1.018575 0.6443765 ... -0.1321751 0.8983962 -0.2608322 1.036548 -1.691862

Hi, thank you for your response, that is helpful but not exactly what I need. The thing is I have many (hundreds) of groups of data that I want to extract and analyse separately i.e — biosferik, Sep 06 '18 at 14:55
Hi, thank you for your response, that is helpful but not exactly what I need. The thing is I have many (hundreds) of groups of data that I want to extract and analyse separately i.e. as the system cycles through it generates a set of data every 10 mins, indicated by the flag '2' and I want to analyse each of these separately, not combine all the indicidences where state = 2. At the moment I am doing this based on the timestamp, but this is not fail-proof, so I wanted to link it to the state variable. Essentially to create a series of dataframes from one initial one. — biosferik, Sep 06 '18 at 15:07

extracting data into a series of new dataframes based on an identifier in R

1 Answers1