0

I have a data frame with about 45k points with 3 columns - weight, persons and population. Population is weight*persons. I want to be able to split the data frame into ntiles(deciles, centiles etc) based on need. The data frame has to be split in a way that there are same number of population points in each ntile.
Which means, the data frame needs to be split at value = sum(population)/ntile. So for example if ntile = 10, then, sum(population)/10 = a. Next I need to add up row values in population column till sum = a, split at that point and continue this until I have run through all the 45K points. A sample of data is below.

     weight  persons population

1   3687.926    9   33191.337
2   3687.926    16  59006.8217
3   3687.926    7   25815.4847
4   4420.088    5   22100.447
5   4420.088    7   30940.6167
6   4420.088    6   26520.5287
7   3687.926    15  55318.8927
8   3687.926    9   33191.3357
9   3687.926    6   22127.5577
10  4452.829    8   35622.6367
11  4452.829    3   13358.4887
12  4452.829    4   17811.3187

I have been trying to use loops. I am stuck on splitting the data frame into the n splits needed. I an new to R. So any help is appreciated.

x= df$population
break_point = sum(x)/10
ntile_points = 0
for(i in 1:length(x))
{
  while(ntile_points != break_point)
  {
  ntile_points = ntile_points+x[i] 
  }
}

sck
  • 65
  • 7
  • 2
    Did you have a look at https://stackoverflow.com/questions/4126326/how-to-quickly-form-groups-quartiles-deciles-etc-by-ordering-columns-in-a? – talat Aug 25 '17 at 12:09
  • Yes. But I need data to be split into quantiles at specific values. In my case the splitting of the df has to be at the value obtained by Point of split = (sum(df$column)/quantile_size). To split at this value I need to add the values in the rows of the column until I get to the value of "point of split" and then split at that point. Hope my answer is not vague. – sck Aug 25 '17 at 12:45

1 Answers1

1

I'm not sure that's what you want, note that your quantile is not necessary an integer, you should substract between each break point :

ntile=10
df=cbind(df,cumsum(df$population))
names(df)[ncol(df)]='Cumsum'
s=seq(0,sum(df$population),sum(df$population)/ntile)
subdfs=list()
for (i in 2:length(s)){
    subdfs=c(subdfs,list(df[intersect(which(df$Cumsum<=s[i]),which(df$Cumsum>s[i-1])),]))
}

Then subdfs is a list which contains 10 data frames split as you wanted. Call the first data frame with subdfs[[1]] and so on. Maybe I did not understand what you want, tell me.

In this way the first df contain all the first values until the cumulate sum of the population stays in the interaval ]0,sum(population)/10], the second contains, the following values where the cumulate sum of the population is in the interval ]sum(population)/10,2*sum(population)/10], etc....

Is that what you wanted ?

Smich7
  • 460
  • 2
  • 14
  • Thank you for the reply. I was unable to run the code. The error I got was "Error in df$Cumsum <= s[i] : comparison (4) is possible only for atomic and list types". Is there maybe some specific library I need to use for this to run? – sck Aug 25 '17 at 15:23
  • What I was looking for is, if we took the just the data above, sum(population)/10 = 37500.54653. if we took the floor/ceiling value, then the 10 quantiles should split such that, sum(population) for each quantile should be approximately equal to 37500. I hope this makes a little more sense,and I apologise for any lack of clarity in the question – sck Aug 25 '17 at 15:39