0

I'm under the similar condition with this case. I'm working on a project which has a large dataframe with about half-million of rows. And about 2000 of users are involving in this.( I get this number by value_counts() counting a column called NoUsager).

I'd like to split the dataframe into several array/dataframe for plotting after. (Several means an array/dataframe for each user) I gott the list of users like:

df.sort_values(by='NoUsager',inplace=True)
df.set_index(keys=['NoUsager'],drop=False,inplace=True)
users = df['NoUsager'].unique().tolist()

I know what's after is a loop to generate the smaller dataframes but I have no idea how to make it happen. And I combined the code above and tried the one in the case but there was no solution for it.

What should I do with it?


EDIT

I want both histogram and boxplot of the dataframe. With the answer provided, I already have a boxplot of all NoUsager. But with large amount of data, the boxplot is too small to read. So I'd like to split the dataframe by NoUsager and plot them separately. Diagrams that I'd like to have:

  1. boxplot, column=DureeService, by=NoUsager
  2. boxplot, column=DureeService, by='Weekday`
  3. histogram, for every Weekday,by=DureeService

I hope this time is well explained.

DataType:

          Weekday NoUsager Periods  Sens  DureeService
DataType   string  string  string string datetime.time

Sample of DataFrame:

Weekday NoUsager Periods Sens DureeService
Lun 000001 Matin + 00:00:05 
Lun 000001 Matin + 00:00:04 
Mer 000001 Matin + 00:00:07 
Dim 000001 Soir  - 00:00:02 
Lun 000001 Matin + 00:00:07 
Jeu 000001 Soir  - 00:00:04 
Lun 000001 Matin + 00:00:07 
Lun 000001 Soir  - 00:00:04 
Dim 000001 Matin + 00:00:05 
Lun 000001 Matin + 00:00:03 
Mer 000001 Matin + 00:00:04 
Ven 000001 Soir  - 00:00:03 
Mar 000001 Matin + 00:00:03 
Lun 000001 Soir  - 00:00:04 
Lun 000001 Matin + 00:00:04 
Mer 000002 Soir  - 00:00:04 
Jeu 000003 Matin + 00:00:50 
Mer 000003 Soir  - 00:06:51 
Mer 000003 Soir  - 00:00:08 
Mer 000003 Soir  - 00:00:10 
Jeu 000003 Matin + 00:12:35 
Lun 000004 Matin + 00:00:05 
Dim 000004 Matin + 00:00:05 
Lun 000004 Matin + 00:00:05 
Lun 000004 Matin + 00:00:05 

And what bothers me is that none of these data is number, so each time they have to be converted.

Thanks in advance!

Community
  • 1
  • 1
ch36r5s
  • 119
  • 2
  • 13

2 Answers2

4

[g for _, g in df.groupby('NoUsager')] gives you a list of data frames where each dataframe contains one unique NoUsager. But I think what you need is something like:

for k, g in df.groupby('NoUsager'):
    g.plot(kind = ..., x = ..., y = ...) etc..
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Can you be more specific? I updated sth about plotting. what's to do with`k`? Although `NoUsager` are numeric, but they are saved as `string`. So I should convert them into number before? – ch36r5s Sep 08 '16 at 23:58
  • `g` is simply a data frame which contains only one unique `NoUsager` and `k` is the unique `NoUsager` for `g`. You can use it however you want, as title or as label, etc... It's just a number here or you can ignore it if you don't need it. Try a small data frame and `print(g)` using the `for` loop, you will see what I mean. – Psidom Sep 09 '16 at 00:02
  • @ch36r5s, please provide a sample data set. In this case the SO community will be able to write a working (and tested) snippet for you. [How to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – MaxU - stand with Ukraine Sep 09 '16 at 06:37
  • @Psidom Thanks, I have been looking for this all day. You may want to add your explanation to your awnser instead of explaning in the comments though. – Martijn Apr 16 '20 at 12:38
1

No need to sort first. You may try this with your original DataFrame:

# import third-party libraries
import pandas as pd
import numpy as np
# Define a function takes the database, and return a dictionary
def splitting_dataframe(df):
    d = {}                                   # Define an empty dictionary
    nousager = np.unique(df.NoUsager.values) # Getting the NoUsage list
    for NU in nousager:                      # Loop over NoUsage list
        d[NU] = df[df.NoUsager == NU]        # I guess this line is what you want most
    return d                                 # Return the dictionary
dictionary = splitting_dataframe(df)  # Calling the function

After this, you can call the DataFrame for specific NoUsager by:

dictionary[target_NoUsager]

Hope this helps.


EDIT

If you want to do a box plot, have you tried:

df.boxplot(column='DureeService', by='NoUsager')

directly? More information here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.boxplot.html


EDIT

If you want a boxplot for several selected 'NoUsager':

targets = [some selected NoUsagers]
mask = np.sum([df.A.values == targets[i] for i in xrange(len(targets))], dtype=bool, axis=0)
df[mask].boxplot(column='DureeService', by='NoUsager')

If you want a histogram for a selected 'NoUsager':

df[target NoUsager].hist(column='DureeService')

If you still need to separate them, @Psidom's first line is good enough.

idchiang
  • 71
  • 5
  • it worked for only once, it stopped after the first value of `nousager`. – ch36r5s Sep 08 '16 at 23:50
  • as @Psidom suggests, I think "groupby" would be an easier way to go. Still, could you provide the error message? – idchiang Sep 09 '16 at 00:42
  • I just added a line for boxplot. No need to split the DataFrame if that is what you only need. – idchiang Sep 09 '16 at 00:54
  • it worked with boxplot. Is it possible to directly plot histogram with this df? – ch36r5s Sep 09 '16 at 02:20
  • Indeed, it's not necessarily to split for boxplot. But with this amount of data, the diagram is too small to explain anything statistic. – ch36r5s Sep 09 '16 at 02:46
  • I am a little bit confused by 'histogram': do you want have a boxplot for several selected 'NoUsager', or do you want to plot a histogram of 'DureeService' for one 'NoUsager'? – idchiang Sep 09 '16 at 02:54
  • a histogram of x=`NoUsager`, y= `DureeService`. And `DureeService` as data type `datetime.time()` but not in seconds. – ch36r5s Sep 09 '16 at 03:36
  • Hmm...I think we are referring to different kinds of histogram. I added two sets of code above. Hope that helps. – idchiang Sep 09 '16 at 05:24