1

I have some question of selecting data chunks depending on condition I provide. Its a multi step process which I think should be done in function and can be applied to the other data sets by lapply.

  1. I have have data.frame which has 19 column (but the example data here has only two) I want to first check the first column (time) rows they should be in range 90 and 54000 if some of them not in this range skip them. After count those chunks, count how many of mag columns show full positive and neg/pos values. If the chunk contains negative number count it as switched state. and give the switching rate something like (total numbers of chunks which shows switched state)/(total number of chunks which range in between 90:54000)

  2. for the data chunks which satisfies the range 90:54000, check the mag for the first observation of the number <0 together with corresponding time


numbers <- c(seq(1,-1,length.out = 601),seq(1,0.98,length.out = 601))
time <- c(seq(90,54144,length.out = 601),seq(90,49850,length.out = 601))
data = data.frame(rep(time,times=12), mag=rep(numbers, times=6))
n <- 90:54000
dfchunk<- split(data, factor(sort(rank(row.names(data))%%n)))
ext_fsw<-lapply(dfchunk,function(x)x[which(x$Mag<0)[1],])
x.n <- data.frame(matrix(unlist(ext_fsw),nrow=n, byrow=T)

Here is what the real dataset look like:

V1 V2 V3 V4     V5      V6     V7      V8      V9    V10     V11     V12    V13    V14     V15    V16
1  90  0  0  0 0.0023 -0.0064 0.9987  0.0810  0.0375 0.9814  0.0829  0.0379 0.9803 0.0715  0.0270 0.9823
2 180  0  0  0 0.0023 -0.0064 0.9987  0.0887 -0.0281 0.9818  0.0956 -0.0288 0.9778 0.0796 -0.0469 0.9772
3 270  0  0  0 0.0023 -0.0064 0.9987 -0.0132 -0.0265 0.9776  0.0087 -0.0369 0.9797 0.0311 -0.0004 0.9827
4 360  0  0  0 0.0023 -0.0064 0.9987  0.0843  0.0369 0.9752  0.0765  0.0362 0.9749 0.0632  0.0486 0.9735
5 450  0  0  0 0.0023 -0.0064 0.9987  0.1075 -0.0660 0.9737  0.0914 -0.0748 0.9698 0.0586 -0.0361 0.9794
6 540  0  0  0 0.0023 -0.0064 0.9987  0.0006  0.0072 0.9808 -0.0162 -0.0152 0.9797 0.0369  0.0118 0.9763

Here is the expected outputs (just and example)

For part 1:

ss (swiched state)   total countable chunks   switching probability
 5                           10                         5/10

For part 2:

time     mag
27207    -0.03
26520    -0.98
32034    -0.67
.
.
.
.
etc 
10 Rep
  • 2,217
  • 7
  • 19
  • 33
Alexander
  • 4,527
  • 5
  • 51
  • 98
  • What exactly are you calling a chunk? – goodtimeslim May 02 '15 at 01:22
  • @goodtimeslim data chunks means number of chunks in between 90:54000 inside of about length of nrows(data.frame) – Alexander May 02 '15 at 02:17
  • I'm still confused as to what a chunk is. If `time` variable was between 90 and 54000 for all rows, does that mean the dataset is one chunk? Is a chunk a single line? Is a chunk a group where they all have the same `time` value? – goodtimeslim May 02 '15 at 02:24
  • If time variable was between 90 and 54000 for each case satisfied inside the total nrows of (data.frame)it is mean the dataset has let's say 5 or 6 case like this each one of dataset is a chunk. I named chunk because there is a another question also use this word – Alexander May 02 '15 at 02:39
  • In another word as the last sentence you have used. – Alexander May 02 '15 at 02:40
  • Could you show a few lines of what your dataframe looks like and a sample of what you want this process to end up as? – goodtimeslim May 02 '15 at 02:57
  • @goodtimeslim I modified the question. – Alexander May 02 '15 at 12:05

1 Answers1

1

Okay, I think have this figured out. I put them into two functions. For each function, you give a dataframe and a column name, and it'll return the requested data.

library(dplyr)
thabescity <- function(data, col){
  filter_vec <- data[col] < 0
  new_df <- data %>%
    filter(filter_vec) %>%
    filter(90 <= time & time <= 54000) %>%
    group_by(time) %>%
    summarise()

  ss <- nrow(new_df)
  total <- length(unique(data$time))
  switching_probability <- ss/total
  results <- c(ss, total, switching_probability)
  output <- as.data.frame(cbind(ss, total, switching_probability))
  return(output)
}

print(thabescity(data, "mag"))
   ss total switching_probability
1 298  1201             0.2481266

You can make a list and run it in a loop to do all the columns and have it come out in a list:

data_names <- names(data)[2:length(names(data))]
first_problem <- list()
for(name in data_names){
  first_problem[[name]] <- thabescity(data, name)
}
first_problem[["mag"]]

   ss total switching_probability
1 298  1201             0.2481266

The second problem is a bit easier:

thabescity2 <- function(data, col){
  data <- data[,c("time", col)]
  filter_vec <- data[col] < 0
  new_df <- data %>%
    filter(filter_vec) %>%
    filter(90 <= time & time <= 54000) %>%
    group_by(time) %>%
    filter(row_number() == 1)

  return(new_df)
}
print(thabescity2(data, "mag"))

Source: local data frame [298 x 2]
Groups: time

       time          mag
1  27207.09 -0.003333333
2  27297.18 -0.006666667
3  27387.27 -0.010000000
4  27477.36 -0.013333333
5  27567.45 -0.016666667
6  27657.54 -0.020000000
7  27747.63 -0.023333333
8  27837.72 -0.026666667
9  27927.81 -0.030000000
10 28017.90 -0.033333333
..      ...          ...

You can do the same thing as above to go through the whole dataframe:

data_names <- names(data)[2:length(names(data))]
second_problem <- list()
for(name in data_names){
  second_problem[[name]] <- thabescity2(data, name)
}
second_problem[["mag"]]

Source: local data frame [298 x 2]
Groups: time

       time          mag
1  27207.09 -0.003333333
2  27297.18 -0.006666667
3  27387.27 -0.010000000
4  27477.36 -0.013333333
5  27567.45 -0.016666667
6  27657.54 -0.020000000
7  27747.63 -0.023333333
8  27837.72 -0.026666667
9  27927.81 -0.030000000
10 28017.90 -0.033333333
..      ...          ...

Double check my results, but I think this does what you want.

goodtimeslim
  • 880
  • 7
  • 13
  • thank you very much for your answer; on the other hand in the case of first part the output data should be look like `ss total switching_probability 1 12 24 0.5` because inside of the data as you can see there are 12 ss state (in which the mag value shows < 0) because the chunks (the data sets satisfies 90:54000) are in total 24. In your answer you only counted one chunk and in one chunk of course there are 1201 state and 298 of them is neg number. – Alexander May 03 '15 at 06:30
  • What I asked was inside of total number of rows how many of them first: satisfies the 90:54000 and count them. secondly: find mag values which only shows neg number. In each chunk only one observation is ok. so in my example data there are only 6 times observation of neg mag numbers not the total neg numbers inside of the one chunk. – Alexander May 03 '15 at 06:44
  • What I mean with chunk can be understand here http://stackoverflow.com/questions/3302356/how-to-split-a-data-frame – Alexander May 03 '15 at 06:49
  • I don't really understand what you're asking. What does "inside of total rows" mean? What is a chunk? I thought you said it was a group of rows that share the same time variable. Is a chunk a column? I see that in your example there are six columns that contain a negative number. Also each row contains a negative number. Is a row a chunk? There is no time variable in your example, does time matter aside from making sure that it's within your specified range? – goodtimeslim May 03 '15 at 07:00
  • ok. the chunk is split the data into a smaller pieces depending on the some condition. In my case, the condition is defined as the time in between 90 and 54000. In my real data that column is not important. the column which I care is V10 which is contain sometimes neg/pos numbers or all pos numbers. I try to find out extracting the total chunks (when the time between 90:54000 is satisfied from first row of data to the end of total row) lets say total rows are like this is 12 so the one chunk corresponding 1/12 of total chunks – Alexander May 03 '15 at 08:21
  • There is a time variable in my example `time <- c(seq(90,54144,length.out = 601),seq(90,49850,length.out = 601)) ` `data = data.frame(rep(time,times=12), mag=rep(numbers, times=6))` so time variable between 90,54144 repeated 12 times (12 chunk) but Mag shows switching state (sw) !in which the neg and pos numbers exist! are 6 times so the switching_probability is 0.5 – Alexander May 03 '15 at 08:35
  • I try to extract first observation of Mag value inside of one chunk. Just one time value and one Mag value time mag 1 27207.09 -0.003333333 2 27207.09 -0.003333333 3 27207.09 -0.003333333 4 27207.09 -0.003333333 5 27207.09 -0.003333333 6 27207.09 -0.003333333 here all numbers are identical due to limit of the example but in real data it is different time and different mag value for each case. – – Alexander May 07 '15 at 08:44