python split into n array(batch) and try to group same id in one batch math

Question

I have a json object:

json ={'message_id': '1', 'token': 'a'}
{'message_id': '2', 'token': 'b'}
{'message_id': '3', 'token': 'c'}
{'message_id': '4', 'token': 'd'}
{'message_id': '4', 'token': 'e'}
{'message_id': '1', 'token': 'f'}
{'message_id': '1', 'token': 'g'}
{'message_id': '1', 'token': 'h'}
{'message_id': '3', 'token': 'm'}
{'message_id': '3', 'token': 'k'}

I want to batch the token into trunks to pass to API call, the catch is try to fit tokens with same message_id in one batch if possible, the idea is try not to split same messageid's token into 2 batches.

for example, I want to divide the 10 messages into 2 batch, that means 5 tokens in each array. So in the example above, 1 has 4 tokens, 2 has 1 token, 3 has 3 tokens, 4 has 2 tokens which adds up to 10. The ideal way to group this is 4+1 and 2+3. the final result I am looking for is:

[['a', 'f', 'g', 'h', 'b'] ,['c','d','e','m','k']]

because 'a', 'f', 'g', 'h' has same message id so have them in one batch instead of split messge_id 1's token into 2 arrays

I think this is more mathmatical than coding. Cuz I am able to batch them easily with the following code if I dont have to consider the grouping under same id in one batch

def batch(list, n):
    for i in range(0, len(list), n):
        print(i)
        yield l[i:i + n]

I will elaborate further, the goal is to split m messages into n batches(input variable), and try to group same message_id into same batch if possible, I understand there's always overfloat possibility and if one message_id has more than m/n tokens, which exceeds the limit and it has to go 2 batches.

Maybe I haven't understood your question well. Do you want to maximize the number of equal `message_id`s in groups? — Andrej Kesely, May 05 '20 at 22:31
no no just try not to split 2 messages with same id into different batches, the idea is try to fit same messageid in one batch if possible — inuyasha yolo, May 05 '20 at 22:35
Can you be more specific about what the issue is? Have you tried solving this on paper, writing some pseudocode? — AMC, May 05 '20 at 22:47
@AndrejKesely, if not possible then it has to be split up. I edited my question for clarity. Thank you Andrej! — inuyasha yolo, May 05 '20 at 23:18
@AMC, I edited my questions for more clarity let me know if it makes sense — inuyasha yolo, May 05 '20 at 23:20

dmmfll · Accepted Answer · 2020-05-05T23:44:15.573

0

Resources:

1.operator.itemgetter

2.Using zip_longest to chunk

import itertools as it
import json
import operator as op

# Copied and pasted from the question
messages_json = """{'message_id': '1', 'token': 'a'}
{'message_id': '2', 'token': 'b'}
{'message_id': '3', 'token': 'c'}
{'message_id': '4', 'token': 'd'}
{'message_id': '4', 'token': 'e'}
{'message_id': '1', 'token': 'f'}
{'message_id': '1', 'token': 'g'}
{'message_id': '1', 'token': 'h'}
{'message_id': '3', 'token': 'm'}
{'message_id': '3', 'token': 'k'}""".replace(
    "'", '"' # replace single quote with double quote for JSON
).splitlines()  # List of JSON strings

messages = (json.loads(item) for item in messages_json)

key = op.itemgetter("message_id") # Use to sort by message_id.
LIMIT = 5

values = (item['token'] for item in sorted(messages, key=key))
for chunk in it.zip_longest(*it.repeat(iter(values), LIMIT), fillvalue=False):
    print([item for item in chunk if item])

OUTPUT is a list with a max length of LIMIT with the values of the messages tokens sorted by message_ids:

['a', 'f', 'g', 'h', 'b']
['c', 'm', 'k', 'd', 'e']

edited May 05 '20 at 23:44

answered May 05 '20 at 22:53

dmmfll

2,666
2
35
41

sorry this is not the end result I want to see. – inuyasha yolo May 05 '20 at 23:23
Please write out the result you want to see. – dmmfll May 05 '20 at 23:24
1

[['a', 'f', 'g', 'h', 'b'] ,['c','d','e','m','k']] @dmmfll – inuyasha yolo May 05 '20 at 23:26
The order of the second list differs from yours because the messages are first sorted in this code by message_id. Does the order matter? Or just the chunk size? – dmmfll May 05 '20 at 23:38
could you please explain to me the math method behind it? I dont quite understand the zip_longest method – inuyasha yolo May 05 '20 at 23:54
Take a look at this question: https://stackoverflow.com/a/29009933/1913726 There are a lot of Python concepts packed into that line of code. Familiarize yourself with the built-in `iter` and how transposition works with zip_longest(*iterable). Imagine a diagonal slicing mechanism down a list of rows. In the mean time, I will see if I can find a post about it. I practiced with it for weeks before I felt like it was familiar. – dmmfll May 06 '20 at 00:03
If the code is the solution you were seeking, please accept and up-vote. Thank you. – dmmfll May 06 '20 at 00:09
this failed the test cases with message_id 4 split to 2 batch: """{'message_id': '1', 'token': 'a'} {'message_id': '2', 'token': 'b'} {'message_id': '3', 'token': 'c'} {'message_id': '4', 'token': 'd'} {'message_id': '4', 'token': 'e'} {'message_id': '1', 'token': 'f'} {'message_id': '1', 'token': 'g'} {'message_id': '1', 'token': 'h'} {'message_id': '3', 'token': 'm'} {'message_id': '1', 'token': 'k'} {'message_id': '1', 'token': 'z'}""" – inuyasha yolo May 06 '20 at 04:43

python split into n array(batch) and try to group same id in one batch math

1 Answers1