I have a data frame as shown below. I want to re-arrange the data frame based on 'Break Point' column.
The expected result should be as below
I have a data frame as shown below. I want to re-arrange the data frame based on 'Break Point' column.
The expected result should be as below
With this sample data:
df <- data.frame(
Range1 = c(1, 2, 3, 5, 10, 12, 16, 20, 21, 28, 33),
Range2 = c(2, 3, 5, 10, 12, 16, 20, 21, 28, 33, 40),
Breakpoint = c("", "", "", "Y", "", "Y", "", "", "Y", "", ""))
A solution with the trailing bits cut of is:
First cut off the dangling bits:
df2 = df[1:max(which(df$Breakpoint=="Y")),]
Then work out length of each group:
> rgroup=rle(rev(cumsum(rev(df2$Break=="Y"))))$lengths
Get where the Y are:
> Ypos = which(df2$Breakpoint=="Y")
Construct an index vector that is the Y positions minus a reversed sequence from 1 to the length of the chunk. Subset:
> df2[rep(Ypos, rgroup) - unlist(lapply(rgroup,function(x){1:x})) +1,]
Range1 Range2 Breakpoint
4 5 10 Y
3 3 5
2 2 3
1 1 2
6 12 16 Y
5 10 12
9 21 28 Y
8 20 21
7 16 20
Add the dangling bits back on if needed.
[edit - new version above added. Code below for historical purposes]
My old version was this, and dealt with the dangling bits:
> group=rev(cumsum(rev(df$Break=="Y")))
> rbind(do.call(rbind,lapply(split(df[group>0,],-group[group>0]),function(x){x[nrow(x):1,,drop=FALSE]}))[,c("Range1","Range2")],df[max(which(df$Break=="Y")),1:2,drop=FALSE],df[group==0,1:2])
and get:
Range1 Range2
-3.4 5 10
-3.3 3 5
-3.2 2 3
-3.1 1 2
-2.6 12 16
-2.5 10 12
-1.9 21 28
-1.8 20 21
-1.7 16 20
9 21 28
10 28 33
11 33 40
If you don't like the row names then drop them. Uses only base R functions.
I'm not sure if this works if there's no trailing matter after the last break but you've not specified the problem well if that can happen.
> group=rev(cumsum(rev(df$Break=="Y")))
This creates a vector that starts at 0 for the last row and increases each time it finds a Y. Reverse that to get a grouping variable for the chunks up to each Y.
This bit wont work if cutpaste because of the commenting I'm about to do:
> rbind(
# we need to bind three things. The reversed chunks, the last break point and
# the trailing stuff:
do.call(
# the trailing stuff is the rbind of the reversed chunks:
rbind,
# split the data into a list of chunks
lapply(
split(df[group>0,],-group[group>0]),
# reverse them
function(x){x[nrow(x):1,,drop=FALSE]}
# and only take the columns we need:
))[,c("Range1","Range2")],
# this is the last Y
df[max(which(df$Break=="Y")),1:2,drop=FALSE],
# this is the trailing rows, get them in order they appear:
df[group==0,1:2])
Annotating it out like that has let me see some optimisations that could be made but that's all for now.
Depending on the size of your data.frame this could be achieved quite manually by a for loop.
BreakPoints <- which(!is.na(DF$`break point`))
if(length(breakPoints) > 0){
startIndex <- 1 #Startindex tells me where i should point the breakPoint
for(i in breakPoints){ #Iterate over breakpoints
#Put the break point at the startIndex row
DF[startIndex:i,] <- DF[c(i, startIndex:(i-1), ]
#Update the placement as the next block
startIndex <- i + 1
}
}
if your data is large, there is likely a more efficient method. In general subsetting via [<-.dataframe
is slow compared to other methods. An initial optimizer could be simply converting the above code to data.table
format, where subsetting is much much faster.