1
            city  temperature  windspeed   event
            day                                                 
            2017-01-01  new york           32          6    Rain
            2017-01-02  new york           36          7   Sunny
            2017-01-03  new york           28         12    Snow
            2017-01-04  new york           33          7   Sunny
            2017-01-05  new york           31          7    Rain
            2017-01-06  new york           33          5   Sunny
            2017-01-07  new york           27         12    Rain
            2017-01-08  new york           23          7  Rain
            2017-01-01    mumbai           90          5   Sunny
            2017-01-02    mumbai           85         12     Fog
            2017-01-03    mumbai           87         15     Fog
            2017-01-04    mumbai           92          5    Rain
            2017-01-05    mumbai           89          7   Sunny
            2017-01-06    mumbai           80         10     Fog
            2017-01-07    mumbai           85         9     Sunny
            2017-01-08    mumbai           89          8    Rain
            2017-01-01     paris           45         20   Sunny
            2017-01-02     paris           50         13  Cloudy
            2017-01-03     paris           54          8  Cloudy
            2017-01-04     paris           42         10  Cloudy
            2017-01-05     paris           43         20   Sunny
            2017-01-06     paris           48         4  Cloudy
            2017-01-07     paris           40          14  Rain
            2017-01-08     paris           42         15  Cloudy
            2017-01-09     paris           53         8  Sunny

The above shows the original data.

Below shows the result using np.array_split(data, 4).

            day city  temperature  windspeed  event                                                
            2017-01-01  new york           32          6    Rain
            2017-01-02  new york           36          7   Sunny
            2017-01-03  new york           28         12    Snow
            2017-01-04  new york           33          7   Sunny
            2017-01-05  new york           31          7    Rain
            2017-01-06  new york           33          5   Sunny
            2017-01-07  new york           27         12    Rain  

            day city  temperature  windspeed  event                                                    
            2017-01-08  new york           23          7  Rain
            2017-01-01    mumbai           90          5   Sunny
            2017-01-02    mumbai           85         12     Fog
            2017-01-03    mumbai           87         15     Fog
            2017-01-04    mumbai           92          5    Rain
            2017-01-05    mumbai           89          7   Sunny             
            day city  temperature  windspeed  event                                                  
            2017-01-06    mumbai           80         10     Fog
            2017-01-07    mumbai           85         9     Sunny
            2017-01-08    mumbai           89          8    Rain
            2017-01-01     paris           45         20   Sunny
            2017-01-02     paris           50         13  Cloudy
            2017-01-03     paris           54          8  Cloudy              
            day city  temperature  windspeed  event             
            2017-01-04     paris           42         10  Cloudy
            2017-01-05     paris           43         20   Sunny
            2017-01-06     paris           48         4  Cloudy
            2017-01-07     paris           40          14  Rain
            2017-01-08     paris           42         15  Cloudy
            2017-01-09     paris           53         8  Sunny

As you can see here, I'm trying to create 4 groups from the original data, making sure that each group has all the cities. however, by using array.split(), it split the data into 4 groups but it does not contain all the cities. I want each group to have Mumbai, Paris and New York. How can I do that?

Meaning to say, what I'm trying to achieve is something like below:

Group 1:

            day city  temperature  windspeed  event                                                
            2017-01-01  new york           32          6   Rain
            2017-01-02  paris           50         13  Cloudy
            2017-01-02    mumbai           85         12    Fog, 
            2017-01-05  new york           31          7    Rain
            2017-01-06  new york           33          5   Sunny
            2017-01-05    mumbai           89          7   Sunny  
            2017-01-05     paris           43         20   Sunny

Group 2:

            day city  temperature  windspeed  event                                                    
            2017-01-04  new york           33          7  Sunny
            2017-01-01    mumbai           90          5  Sunny
            2017-01-03  paris           54          8  Cloudy
            2017-01-07  new york           27         12    Rain 
            2017-01-06    mumbai           80         10     Fog
            2017-01-09     paris           53         8  Sunny

Group 3:

            day city  temperature  windspeed  event         
            2017-01-02  new york           36          7  Sunny                                         
            2017-01-03  mumbai           87         15    Fog
            2017-01-01   paris           45         20  Sunny,   
            2017-01-08    mumbai           89          8    Rain
            2017-01-06     paris           48         4  Cloudy
            2017-01-07     paris           40          14  Rain

Group 4:

            day city  temperature  windspeed  event             
            2017-01-03  new york           28         12   Snow,  
            2017-01-04  mumbai           92          5   Rain
            2017-01-07    mumbai           85         9     Sunny
            2017-01-04  paris           42         10  Cloudy
            2017-01-08     paris           42         15  Cloudy
            2017-01-08  new york           23          7  Rain

As you can see from the expected result, the main thing is that all the groups contain each topic.

What I have in mind is to group the data by city, then from each city's dataframe, divide the data into 4 groups, then for each group in the city, combine the data to get 4 final group.

Lily
  • 55
  • 1
  • 8
  • Thanks. And just wondering given your comment on jpp's answer. Is the only condition for the grouping that the four groups contain the 3 cities? So it doesnt matter if `2017-01-03 | new york` is in group 1 or group 4 ? – gyx-hh Aug 07 '18 at 11:26
  • @gyx-hh yeap. But it would be better if the number of each cities in each group is the same. meaning that for example in group 1, i prefer it to have 3 new york, 2 mumbai, 2 paris. rather than 5 new york, 1 mumbai and 1 paris. because my real data consist of more than 2000 data – Lily Aug 07 '18 at 11:32

2 Answers2

2

You can create a helper column via GroupBy + cumcount to count the occurrence of each city.

Then use dict + tuple with another GroupBy to create a dictionary of dataframes, each one containing exactly one occurence of each city.

# add index column giving count of city occurrence
df['index'] = df.groupby('city').cumcount()

# create dictionary of dataframes
d = dict(tuple(df.groupby('index')))

Result:

print(d)

{0:                city  temperature  windspeed  event  index
 day                                                      
 2017-01-01  newyork           32          6   Rain      0
 2017-01-01   mumbai           90          5  Sunny      0
 2017-01-01    paris           45         20  Sunny      0,
 1:                city  temperature  windspeed   event  index
 day                                                       
 2017-01-02  newyork           36          7   Sunny      1
 2017-01-02   mumbai           85         12     Fog      1
 2017-01-02    paris           50         13  Cloudy      1,
 2:                city  temperature  windspeed   event  index
 day                                                       
 2017-01-03  newyork           28         12    Snow      2
 2017-01-03   mumbai           87         15     Fog      2
 2017-01-03    paris           54          8  Cloudy      2,
 3:                city  temperature  windspeed   event  index
 day                                                       
 2017-01-04  newyork           33          7   Sunny      3
 2017-01-04   mumbai           92          5    Rain      3
 2017-01-04    paris           42         10  Cloudy      3}

You can then extract individual "groups" via d[0], d[1], d[2], d[3]. In this particular case, you may wish to group by dates instead, i.e.

d = {df_.index[0]: df_ for _, df_ in df.groupby('index')}
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thank you. The results are kinda what I am looking for. However, I tried it on my actual data, it seems that for this code, if there are more than 12 data, let say 25 data, it will split the data into 9 groups ( 25/3 - 3 as there are 3 different cities). However, what is the code if I want it to be exactly 4 groups only, so it doesn't matter if there are two same cities in the group, as long as every cities are in each group, and it doesn't matter if each group has uneven numbers – Lily Aug 07 '18 at 11:00
  • @Lily, This isn't clear. Do you need to make sure each dataframe result has the same number of rows. Or could you just tack all the extra dataframes onto the end of the 4th? Or should they be added randomly to the first 4 groups? – jpp Aug 07 '18 at 11:01
  • Let's say if I have 25 data with 3 different cities only and I want to group into 4 groups, I want to have 3 groups of 6 data, with 1 group of 7 data (since it is impossible to divide 25 data evenly into 4 groups) – Lily Aug 07 '18 at 11:04
  • 1
    @Lily, OK, this is a bit involved. I can't see a simple solution. Maybe un-accept this answer and make your question clearer with a more representative example. – jpp Aug 07 '18 at 11:05
0

This is my approach to this. First sort your dataframe by day and city:

df = df.sort_values(by=['day', 'city'])

Next find an even split of 4 groups for your dataframe - if the split is not even then the last group will get the remaining:

n = int(len(df)/4)
groups_n = np.cumsum([0, n, n, n, len(df)-(3*n)])
print(groups_n)
OUT >> array([ 0,  6, 12, 18, 25], dtype=int32)

groups_n is the start and end for each group. So Group 1 I will take df.iloc[0:6] and Group 4 I will take df.iloc[18:25].

So your final dictionary, d, of the 4 group split of your dataframe will be:

d = {}
for i in range(4):
    d[i+1] = df.iloc[groups_n[i]:groups_n[i+1]]

Example Outputs:Group 1 (d[1])

            city      temperature  windspeed    event
day             
2017-01-01  mumbai    90           5            Sunny
2017-01-01  new york  32           6            Rain
2017-01-01  paris     45           20           Sunny
2017-01-02  mumbai    85           12           Fog
2017-01-02  new york  36           7            Sunny
2017-01-02  paris     50           13           Cloudy

Group 4: (d[4])

            city       temperature  windspeed   event
day             
2017-01-07  mumbai     85           9           Sunny
2017-01-07  new york   27           12          Rain
2017-01-07  paris      40           14          Rain
2017-01-08  mumbai     89           8           Rain
2017-01-08  new york   23           7           Rain
2017-01-08  paris      42           15          Cloudy
2017-01-09  paris      53           8           Sunny
gyx-hh
  • 1,421
  • 1
  • 10
  • 15
  • @gyx-xh I've tried your code, however, the results that I got from the code were same like using np.array_split(data, 4). It grouped all the cities together, in one group theres only 1 or 2 city. – Lily Aug 08 '18 at 01:31