Python: Given a group of strings uniformly bucket them into k buckets so that same strings go to the same bucket

Question

I have a set (of 2000) rows with many elements per row. One of the elements in a row is a string ("name") that is common per group of 5 rows (the total number of unique names is 500). I want rows with the same "name" to end up in the same bucket. So the function should always return the same value for the given input.

I want to use it for a k-fold cross validation, so I need to create k buckets with numbers of elements as uniformly distributed as possible, +/- few elements is fine but more than 10% is not.

For k = 10 I should have 10 buckets with 200 elements in each, 190 or 210 is ok, but 250 and 180 is not. I tried this answer but it did not give me a very uniform result. This may be due to a dataset itself but it would be great to have somewhat balanced number of elements per bucket. K is usually either 5 or 10.

An example:

name1, date1_1, location1_1, number1_1

name1, date1_2, location1_2, number1_2 ...

name1, date1_5, location1_5, number1_5

name2, date2_1, location2_1, number2_1 ...

name2, date2_5, location2_5, number2_5 ...

name400, date400_1, location400_1, number400_1 ...

name400, date400_5, location400_5, number400_5

Output example:

i,name1, date1_1, location1_1, number1_1

i,name1, date1_2, location1_2, number1_2 ...

i,name1, date1_5, location1_5, number1_5

j,name2, date2_1, location2_1, number2_1 ...

j,name2, date2_5, location2_5, number2_5 ...

k,name400, date400_1, location400_1, number400_1 ...

k,name400, date400_5, location400_5, number400_5

where 1 < i, j, k < K (K = 5 or K = 10)

Question is very unclear. What is intended results, what does data look like, why can you not just partition it cleanly 1-20, 21-40 ad. nasueam? Provide example also please — Erich, Aug 28 '17 at 21:36
Is it possible then to just set some value K to 0, add the first group of names to K=0, then iterate K to 1 ..... until K = 5/10 then loop back to K=0? I assume this is for some ML project as you mentioned cross validation so do you need some form of random sampling designed in? — Erich, Aug 28 '17 at 21:56
I added an example, thank you for the suggestion. Yes, partitioning would be an option but I would rather not assume that the dataset came in already sorted. — hebeha, Aug 28 '17 at 21:56

score 1 · Answer 1 · answered Aug 28 '17 at 21:58

What you are asking for isn't feasible without more constraints.

Imagine if your input consisted of the input string "A" N times, with N arbitrarily large, and input string "B" only 1 times. What would like the output to be?

In any case what you want to do is solve is bin-packing optimization problem.

Woody1193 · Accepted Answer · 2017-08-28T22:37:59.487

What you want is a hash-table, yes? In that case just create a dictionary of size K, and devise a hash-function that takes your string as input and comes back with the index. In the example you provided, an appropriate one might be:

h = int(name.split(',')[0].strip("name")) % K

To be fair, this is pretty naive and doesn't take into account the distribution of your names (you could have many with name1 and very few with name400 for example) but if they are more-or-less the same then that method should work reasonably well.

If your names aren't as convenient as that, you could create a secondary table that simply takes in your name and spits out a number. For instance, suppose you had the names: "Bob", "Sally", "Larry", ...

nameIndexMappings = {"Bob" : 0, "Sally" : 1, "Larry" : 2}
h = nameIndexMappings[name.split(',')[0]] % K

Then you can setup another dictionary like this:

rowMapping = dict()
index = 0
for i in range(0, K):
    rowMapping[i] = list()
for row in rows:
    name = row.split(',')[0]
    if (name not in nameIndexMappings):
        nameIndexMappings[index] = name
        index += 1
    h = nameIndexMappings[name] % K
    rowMapping[h].append(row)

After doing this, rowMapping should contain K lists each with about the same number of elements in them (assuming, of course, that all your names are more-or-less equally distributed).

The "name1" was just an example, I don't have numbers in the name the way you assumed. Names can be "McDonald's", "Starbucks", "Coffee shop", "any name 123@". So we can assume that names are well distributed, so my concern is not the case @bearrito mentioned where one case/name is very dominant (but I agree that is a valid point in general). — hebeha, Aug 28 '17 at 22:17
@hebeha I figured that might be the case. See my updated answer — Woody1193, Aug 28 '17 at 22:26
If your names are well-distributed can you not compute the hash as the hashcode modulo K? — bearrito, Aug 31 '17 at 18:00
You could if you preferred. I think the result would be similar — Woody1193, Aug 31 '17 at 18:01

Python: Given a group of strings uniformly bucket them into k buckets so that same strings go to the same bucket

2 Answers2