2

I have a dataframe in pandas which I need to group and store in a new array where I need the size of every group with a specific size and if one exceeds the minimum size, it should be added to one of the previous groups that have the smallest size. For example, after I grouped the data, I will have groups G that are len(G)<=b, len(G)>=a, or a <= len(G) <= b. So, I need to make the groups with len(G)>=a to meet the condition a <= len(G) <= b.

The code is working now. So, I would like to know if there is a more convenient way to do that.

import numpy as np
import pandas as pd

rng = np.random.default_rng()  # Just for testing
df = pd.DataFrame(rng.integers(0, 10, size=(1000, 4)), columns=list('ABCD'))
# The dataframe is grouped depend on specific column.
ans = [pd.DataFrame(y) for x, y in df.groupby(df.columns[3], as_index=False)] 

n = 20 # The maximum size of the group is 25

new_arrayi_index = 0
new_array = []
for count_index in range(len(ans)):
    l = ans[count_index]
   
    if len(l) > n:

        df_shuffled = pd.DataFrame(l).sample(frac=1)
        final = [df_shuffled[i:i+n] for i in range(0,df_shuffled.shape[0],n)]

        for inde in range(len(final)):
            if len(final[inde]) <= 5 and new_arrayi_index != 0: #The minimum size of the group is 5

                new_array[new_arrayi_index - 1]=new_array[new_arrayi_index - 1]+final[inde]

            else:
                new_array.append(final[inde])
                new_arrayi_index += 1

    else: 

        new_array.append(l)
        new_arrayi_index += 1

count_index_ = 0
for count_index in range(len(new_array)):
    print("count", count_index, "Size", len(new_array[count_index]))
    print(new_array[count_index])
    count_index_ += count_index

print(count_index_)
I_Al-thamary
  • 3,385
  • 2
  • 24
  • 37
  • 4
    Can you maybe provide some minimal example data and the rest of the variables(e.g. `ans`) so we can run your code using only the information in the question? There's a few things I can spot right away(e.g. `len(l)//n`) but to properly refactor this, I'd need to be able to run this first. – jaaq Jan 25 '22 at 15:47
  • I have added comments to explain these variables. – I_Al-thamary Jan 25 '22 at 16:37
  • Your title suggests that you want samples of size `a <= len(l) <= b` and for that you can either remove items or repeat items. Is this correct? – Bob Jan 25 '22 at 22:49
  • I want a samples size of size `a <= len(l) <= b` but if the group cannot fit the condition, I have to add it to the smallest group. the repeated item can be there. Suppose I have 3 groups of 20 elements and the last group is one. So, I should add the one to the 20. – I_Al-thamary Jan 26 '22 at 08:27
  • so, you want to split your data into a list of dims `N x num_cols` where N is 20 here. Ideally you want all your entries in `new_array` to be of size 20 and a single one that may be smaller (but if it would be smaller than the minimum it should be slightly over 20 instead) right? Does the order of the individual entries matter? – jaaq Jan 26 '22 at 13:23
  • It may be less than `20` but not less than `b` unless if the group itself is already less than `b`. The sort will be done before the split. – I_Al-thamary Jan 26 '22 at 13:40
  • Let me try again: A valid solution will produce a list of groups satisfying the following conditions. (1) All the records in a group have the same value for the grouping column. (2) The number of records must be between `a` and `b` (inclusive). – Bob Jan 27 '22 at 16:45
  • I'm interested in the groups that are greater than `a` where I need them to be between `a` and `b`. You can see that I only split the groups that are greater than `n` . `if len(l) > n:` – I_Al-thamary Jan 27 '22 at 16:54

3 Answers3

2

I was following this post since the beginning curious about how the discussion would g, because the OP's problem is not always possible to solve.

Existence of a solution

Take the following example: A group has 19 elements and you want to split it in sections of size between 10 and 15.

The solution exists if and only if exists an integer g, such that n/b <= g <= n/a. In this case you can see that g sections of length a will use g*a <= n elements, and sections of length b will use g*b >= n.

In this situation it is also possible to have a balanced partition, in the sense that the largest section will be at most one record larger than the smallest section (the smallest will have n//g records).

Restating the problem

We could do a slight modification to the problem as split in the minimum possible number of sections containing at most b records each. Such that the length of each section satisfy a <= len(s) <= a+1.

Notice that in this case we are adjusting a to be the closest possible from b so that the problem will have a solution. For solvable problems the solution will be a solution to the original problem, for problems that can't be solved it will modify the original requirement by reducing a so that the problem can be solved.

The example above would become: Split 19 elements in the minimum possible number of balanced groups with no more than 15 elements. Then the solution is having one section of 10 and one section of 9 elements.

A pythonic solution

def group_and_split(df, b, column):
    '''
    - df    : a datafame
    - b     : the largest allowed section
    - column: the column by which the data must be grouped
    '''
    
    # doing it in a pythonic way
    return [[y] if len(y) <= n else np.array_split(y, (len(y)+b-1)//b)
             for x, y in df.groupby(column, as_index=False)]

You can check that it gives a solution to the restated problem

pd.DataFrame([{
    'num-sections': len(g), 
    'largest-section': max(len(gi) for gi in g), 
    'smallest-sections':min(len(gi) for gi in g)
} for g in group_and_split(df, 25, 'D')])

A complete running code

import pandas as pd
import numpy as np

rng = np.random.default_rng(1)  # Just for testing
df = pd.DataFrame(rng.integers(0, 10, size=(1000, 4)), columns=list('ABCD'))

def group_and_split(df, b, column, n=1):
    '''
    - df    : a datafame
    - b     : the largest allowed section
    - column: the column by which the data must be grouped
    - n     : don't split groups that are smaller than this
    '''

    # doing it in a pythonic way
    return [[y] if len(y) <= n else np.array_split(y, (len(y)+b-1)//b)
             for x, y in df.groupby(column, as_index=False)]



pd.DataFrame([{
    'num-sections': len(g), 
    'largest-section': max(len(gi) for gi in g), 
    'smallest-sections':min(len(gi) for gi in g)
} for g in group_and_split(df, 25, 'D', 93)])

enter image description here

Printing all the groups

for group in group_and_split(df, 25, 'D'):
    for section in group:
        print(section)
Bob
  • 13,867
  • 1
  • 5
  • 27
  • It is not working with my data and it shows a few groups with a small size and shows `AttributeError: 'str' object has no attribute 'itertuples'` https://ibb.co/s9BQJYB – I_Al-thamary Jan 31 '22 at 22:28
  • As far as I can see it works with the data you as you gave. I added a complete running code to make it easy for you to check. pandas 1.3.2, numpy 1.19.5 here. – Bob Jan 31 '22 at 23:38
  • Could you please show the data of every group? – I_Al-thamary Jan 31 '22 at 23:56
  • Pleasee take a closer look at the output of `group_and_split`. It will give a list of lists, each list corresponding to one group. You have the code you can look whatever you want. e.g. `group_and_split(df, 25, 'D')[2][0]` will give the first section of the third group. – Bob Feb 01 '22 at 00:02
  • I tried to show them and it shows the first 5 groups and then shows `AttributeError: 'str' object has no attribute 'itertuples'` – I_Al-thamary Feb 01 '22 at 00:06
  • Could please you post a separate question about what you seeing with a minimal reproducible example, so that we keep this post tidy as it is. Thank you. – Bob Feb 01 '22 at 08:52
  • Actually, I have run your code and it shows the `num-sections`, `largest-section` , and `smallest-sections` but it showed a problem when I printed the dataframe. So, I would like to print all the groups. – I_Al-thamary Feb 01 '22 at 09:04
  • Added a code snippet that prints all the sections from all the groups. – Bob Feb 01 '22 at 10:27
  • It does not exceed `25` but I only need to split the group that is greater than `n`. I'm sorry but I can not see that in the current code. I need to keep all the groups that are less than `n` and split the group that is greater than `n` to be between `a` and `b`. – I_Al-thamary Feb 01 '22 at 10:47
1

I wrote a function that splits the dataframe into chunks that are equal to the max size. It checks the size of the remainder for the last chunk, and if the remainder is smaller than the minimum size, it splits the last two chunks into two chunks of approximately equal size.

Building off answer at Split a large pandas dataframe

import numpy as np
import pandas as pd


rng = np.random.default_rng(seed=1)  # Just for testing
df = pd.DataFrame(rng.integers(0, 10, size=(1000, 4)), columns=list('ABCD'))
# The dataframe is grouped depend on specific column.

n = 20  # The maximum size of the group is 25


# https://stackoverflow.com/questions/17315737/split-a-large-pandas-dataframe

def split_dataframe(df, chunk_size=20, min_size=10):

    chunks = list()
    remainder = len(df) % chunk_size

    if 0 < remainder < min_size:
        num_chunks = len(df) // chunk_size - 1
        for i in range(num_chunks):
            chunks.append(df[i * chunk_size:(i + 1) * chunk_size])
        df_ = df[(num_chunks) * chunk_size:]
        last_break = int(len(df_) / 2)
        chunks.append(df_[:last_break])
        chunks.append(df_[last_break:])
        return chunks
    else:
        num_chunks = len(df) // chunk_size + 1
        if remainder == 0:
            num_chunks += -1
        for i in range(num_chunks):
            chunks.append(df[i*chunk_size:(i+1)*chunk_size])
        return chunks


new_array = []
for group, df_ in df.groupby(df.columns[3], as_index=False):
    if len(df_) > n:
        new_array.extend(split_dataframe(df_))
    else:
        new_array.extend(df_)

count_index_ = 0
for count_index in range(len(new_array)):
    print("count", count_index, "Size", len(new_array[count_index]))
    print(new_array[count_index])
    count_index_ += count_index

print(count_index_)
Troy D
  • 2,093
  • 1
  • 14
  • 28
  • Thanks, I only need to split the groups with a size of more than 20. Also, after running the code, I got a group with a size 0. – I_Al-thamary Jan 28 '22 at 18:57
  • I assume the problem was that some dataframes were exactly divisible by n, so I added a couple lines to check if remainder == 0: num_chunks += -1. Let me know if that fixes it. I added a check to skip the function if the dataframe has fewer rows than n, but it's not actually necessary. – Troy D Jan 31 '22 at 01:36
  • I have added to my code and it shows an error of some lists that `AttributeError: 'str' object has no attribute 'itertuples'` and shows missing items. – I_Al-thamary Jan 31 '22 at 11:06
-2

change this line -> ans = [pd.DataFrame(y) for x, y in df.groupby(df.columns[3], as_index=False)] to ans = [pd.DataFrame(y) for x, y in df.groupby(df.columns[3].min(), as_index=False)] for min

and ans = [pd.DataFrame(y) for x, y in df.groupby(df.columns[3].max(), as_index=False)] for max