I have a set (of 2000) rows with many elements per row. One of the elements in a row is a string ("name") that is common per group of 5 rows (the total number of unique names is 500). I want rows with the same "name" to end up in the same bucket. So the function should always return the same value for the given input.
I want to use it for a k-fold cross validation, so I need to create k buckets with numbers of elements as uniformly distributed as possible, +/- few elements is fine but more than 10% is not.
For k = 10 I should have 10 buckets with 200 elements in each, 190 or 210 is ok, but 250 and 180 is not. I tried this answer but it did not give me a very uniform result. This may be due to a dataset itself but it would be great to have somewhat balanced number of elements per bucket. K is usually either 5 or 10.
An example:
name1, date1_1, location1_1, number1_1
name1, date1_2, location1_2, number1_2 ...
name1, date1_5, location1_5, number1_5
name2, date2_1, location2_1, number2_1 ...
name2, date2_5, location2_5, number2_5 ...
name400, date400_1, location400_1, number400_1 ...
name400, date400_5, location400_5, number400_5
Output example:
i,name1, date1_1, location1_1, number1_1
i,name1, date1_2, location1_2, number1_2 ...
i,name1, date1_5, location1_5, number1_5
j,name2, date2_1, location2_1, number2_1 ...
j,name2, date2_5, location2_5, number2_5 ...
k,name400, date400_1, location400_1, number400_1 ...
k,name400, date400_5, location400_5, number400_5
where 1 < i, j, k < K (K = 5 or K = 10)