-1

I have a large dataframe that is structured as follows:

vals
  idx v
1   1 3
2   2 2
3   3 0
4   4 2
5   5 0
6   6 0
7   7 0
.
.
.

I need to put the content of this data frame into a csv file in the following way: I need to iterate through the 'idx' column in steps of, let's say 2 for example, and from every second idx value, need the 'v' value in the corresponding row and the next 2 'v' values below this.

Hence taking the first 7 rows of the above example dataframe:

> d=data.frame()
> temp=seq(vals[1,1],vals[nrow(vals),1]-1,2)
> for(i in temp){d=rbind(d,c(vals[which(vals[,1]==i)[1],1],vals[which(vals[,1]>=i & vals[,1]<=i+2),2]))}
> d
  X1 X3 X2 X0
1  1  3  2  0
2  3  0  2  0
3  5  0  0  0

The above code gives me what I want. However, in reality the 'vals' dataframe that I am working with is really big and this is taking an infeasible amount of time to process... I am trying to get a working solution for the parallelized version of the above code:

> d=data.frame()
> temp=seq(vals[1,1],vals[nrow(vals),1]-1,2)
> put_it=function(i){d=rbind(d,c(vals[which(vals[,1]==i)[1],1],vals[which(vals[,1]>=i & vals[,1]<=i+2),2]))}
> mclapply(temp,put_it,mc.cores = detectCores()
[[1]]
  X1 X3 X2 X0
1  1  3  2  0

[[2]]
  X3 X0 X2 X0.1
1  3  0  2    0

[[3]]
  X5 X0 X0.1 X0.2
1  5  0    0    0

hence the 'd' data frame is reset each time which does not give me what I want- as I need fro all of the data to be in the same dataframe.

I also considered writing the data, as a new row, to a file each time an iteration was complete:

temp=seq(vals[1,1],vals[nrow(vals),1]-1,2)
put_it=function(i){cat(vals[which(vals[,1]==i)[1],1],
         ',',paste(vals[which(vals[,1]>=i & vals[,1]<=i+10000),2],
          sep=' '),'\n',sep=' ',append=T,
           file='~/files/test.csv')}
mclapply(temp,put_it,mc.cores = detectCores())

Note that this time that I am adding vectors of 10000 rather than just the next 2 values However this runs into problems when 2 jobs execute at the same time and I get a file with multiple new rows started in the middle of other rows:

 [middle of a row]........0 0 0 0 01  0,  00  00  00  00  0 0 0 0 0 0 0 .....
xenopus
  • 78
  • 8

1 Answers1

1

You do not need a loop for this task and might use a vectorized approach. You only need to create a sequence that specifies the rows from which values should be extracted. Below a short example that you might adapt. I hope I understood your question correctly and this is the kind of output you need. Let me know if this works for you.

Updated the answer by an example with foreach emphasizing that parallelisation is not necessary for the given example. The foreach example was just added to show one possible way of how to perform parallelisation. (please note that belwo example is not failsafe concerning chunking, etc., for more complex referencing during parallelisation you need to think about how to split data and generate references).

set.seed(0)
data <- data.frame(idx = 1:10, val = sample(101:110, 10))
#    idx val
# 1    1 109
# 2    2 103
# 3    3 110
# 4    4 105
# 5    5 106
# 6    6 102
# 7    7 104
# 8    8 108
# 9    9 107
# 10  10 101
#specify which rows shall be used for extraction
extract <- seq(from = 2, to = nrow(data), by = 2)
#[1] 2  4  6  8 10
#to get, e.g., entries of each following row simply add +1 to the extraction sequence
#and so on +2/+3, etc. for additional entries
data_extracted <- cbind(X1 = data[extract, "val" ], X2 = data[extract+1, "val"])
data_extracted
#       X1  X2
# [1,] 103 110
# [2,] 105 106
# [3,] 102 104
# [4,] 108 107
# [5,] 101  NA  

#parallel version with foreach
#certainly not the most elegant approach and not failsafe concerning chunking/splitting
library(foreach)
library(parallel)
library(doParallel)

n_cores <- 2

data_rows <- 1:nrow(data)
chunk_size <- nrow(data)/n_cores
#chunking solution from here: https://stackoverflow.com/questions/3318333/split-a-vector-into-chunks-in-r
chunk_rows <- split(data_rows,
                     ceiling(seq_along(data_rows)/(chunk_size))
                     )

chunk_ext <- split(extract, c(rep(1:length(chunk_rows), each = floor(chunk_size/2)), length(chunk_rows)))

cluster <- parallel::makeCluster(n_cores)
doParallel::registerDoParallel(cluster)

data_extracted_parallel <- foreach(j = 1:length(chunk_rows)
        ,.combine = rbind) %dopar% {
          chunk_dat <- data[chunk_rows[[j]], ]
          chunk_ext <-  chunk_ext[[j]]
          chunk_ext  <- which( chunk_dat$idx %in% chunk_ext)
          cbind(X1 = chunk_dat[ chunk_ext, "val" ], X2 =  chunk_dat[chunk_ext+1, "val"])
}

stopCluster(cluster)

all.equal(data_extracted_parallel, data_extracted)
#[1] TRUE
Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22
  • thank you for the answer- your code is certainly an improvement- but referring to the actual question title is it possible to grow the dataframe in a way that is parallelised? – xenopus Mar 18 '18 at 22:40
  • Of course, parallelisation is possible. In your case I would recommend a look into the `foreach` package since the syntax `foreach (i in ...) %dopar% {...}` is very close to "normal" as you have used them. You will also have to split your data into chunks to iterate over. But again, in R you should avoid growing objects in a loop (this is different e.g. in C or Python) - a workaround to speed up such loops i R is to initialize the objects before looping, which is possible in your case, since you know their length in advance. Does that help? – Manuel Bickel Mar 19 '18 at 06:48
  • sorry, for the syntax error above, of course, correct use is `foreach (i = ...` – Manuel Bickel Mar 19 '18 at 09:07
  • thank you if you edit your answer slightly by putting in something at the end about the foreach package I'll mark your answer as the right one :) – xenopus Mar 19 '18 at 12:06