I have a large dataframe that is structured as follows:
vals
idx v
1 1 3
2 2 2
3 3 0
4 4 2
5 5 0
6 6 0
7 7 0
.
.
.
I need to put the content of this data frame into a csv file in the following way: I need to iterate through the 'idx' column in steps of, let's say 2 for example, and from every second idx value, need the 'v' value in the corresponding row and the next 2 'v' values below this.
Hence taking the first 7 rows of the above example dataframe:
> d=data.frame()
> temp=seq(vals[1,1],vals[nrow(vals),1]-1,2)
> for(i in temp){d=rbind(d,c(vals[which(vals[,1]==i)[1],1],vals[which(vals[,1]>=i & vals[,1]<=i+2),2]))}
> d
X1 X3 X2 X0
1 1 3 2 0
2 3 0 2 0
3 5 0 0 0
The above code gives me what I want. However, in reality the 'vals' dataframe that I am working with is really big and this is taking an infeasible amount of time to process... I am trying to get a working solution for the parallelized version of the above code:
> d=data.frame()
> temp=seq(vals[1,1],vals[nrow(vals),1]-1,2)
> put_it=function(i){d=rbind(d,c(vals[which(vals[,1]==i)[1],1],vals[which(vals[,1]>=i & vals[,1]<=i+2),2]))}
> mclapply(temp,put_it,mc.cores = detectCores()
[[1]]
X1 X3 X2 X0
1 1 3 2 0
[[2]]
X3 X0 X2 X0.1
1 3 0 2 0
[[3]]
X5 X0 X0.1 X0.2
1 5 0 0 0
hence the 'd' data frame is reset each time which does not give me what I want- as I need fro all of the data to be in the same dataframe.
I also considered writing the data, as a new row, to a file each time an iteration was complete:
temp=seq(vals[1,1],vals[nrow(vals),1]-1,2)
put_it=function(i){cat(vals[which(vals[,1]==i)[1],1],
',',paste(vals[which(vals[,1]>=i & vals[,1]<=i+10000),2],
sep=' '),'\n',sep=' ',append=T,
file='~/files/test.csv')}
mclapply(temp,put_it,mc.cores = detectCores())
Note that this time that I am adding vectors of 10000 rather than just the next 2 values However this runs into problems when 2 jobs execute at the same time and I get a file with multiple new rows started in the middle of other rows:
[middle of a row]........0 0 0 0 01 0, 00 00 00 00 0 0 0 0 0 0 0 .....