I have a large matrix with 12 columns and approximately 1.000.000 rows. Each column represents the money spent by a client in a given month, so with the 12 columns I have information for 1 full year. Each row represents one client.
I need to divide the people into groups based on how much money they spent each month, and I consider the following intervals:
- money=0
- 0<money<=25
- 25<money<=50
- 50<money<=75
So for example group1 would be formed by clients that spent 0$ each month for the whole year, group2 would be clients who spent between 0 and 25$ the first month, and 0$ the rest of the months, and so on. In the end I have 12 months, and 4 intervals, so I need to divide data into 4^12=16.777.216 groups (I know this yields to more groups than observations, and that many of the groups will be empty or with very few clients, but that is another problem, so far I am interested in doing this division into groups)
I am currently working in R although I could also switch to Python if required (those are the programming languages I control best), and so far my only idea has been to use nested for
loops, one for loop for each month. But this is very, very slow.
So my question is: is there a faster way to do this?
Here I provide a small example with fake data, 10 observations (instead of the 1.000.000), 5 columns (instead of 12) and a simplified version of my current code for doing the grouping.
set.seed(5)
data = data.frame(id=1:10, matrix(rnorm(50), nrow=10, ncol=5))
intervals = c(-4, -1, 0, 1, 4)
id_list = c()
group_list = c()
group_idx = 0
for(idx1 in 1:(length(intervals)-1))
{
data1 = data[(data[, 2] >= intervals[idx1]) & (data[, 2] < intervals[idx1+1]),]
for(idx2 in 1:(length(intervals))-1)
{
data2 = data1[(data1[, 3] >= intervals[idx2]) & (data1[, 3] < intervals[idx2+1]),]
for(idx3 in 1:(length(intervals)-1))
{
data3 = data2[(data2[, 4] >= intervals[idx3]) & (data2[, 4] < intervals[idx3+1]),]
for(idx4 in 1:(length(intervals)-1))
{
data4 = data3[(data3[, 5] >= intervals[idx4]) & (data3[, 5] < intervals[idx4+1]),]
for(idx5 in 1:(length(intervals)-1))
{
data5 = data4[(data4[, 6] >= intervals[idx5]) & (data4[, 6] < intervals[idx5+1]),]
group_idx = group_idx + 1
id_list = c(id_list, data5$id)
group_list = c(group_list, rep(group_idx, nrow(data5)))
}
}
}
}
}