0

Cause of the kind and helping community i solved the first problem i had in my work which you can see here: Basic Problem - necessary for understanding the upcoming

After i used this, i wanted to visualize the distribution of the classes and the nan-Values in the features. Therefore i plottet it in a bar-diagramm. with a few classes it is pretty handy.

the problem is, i have about 120 different classes and all in all 50000 data objects - the plots are not readable with this amount of data.

therefor i wanted to split the visualization.

for each class there should be a subplot that is showing the sum of the nan-values of each feature.

Data:

CLASS FEATURE1 FEATURE2 FEATURE3
  X      1        1        2
  B      0        0        0
  C      2        3        1

Actual Plot:

Normal Plot

Expected Plots:

enter image description here enter image description here enter image description here

Non of my approaches worked so far.

  1. i tried to solve it through df.groupBy('Class').plot(kind="barh", subplots=True) - completely destroyed the layout and plotted per feature not per class.
  2. i tried this approach but if i write my groupBy-df into the Variable 'grouped' i can print it in a perfect format with all informations, but i can not access it in the way it is done in the solution. i always get the error: 'string indices must be integers'

my approach:

grouped = df.groupby('Class') 
for name, group in grouped: 
    group.plot.bar()

EDIT - Further Information

The Data I use is completely categorical - no numerical values - i want to display the amount of nan-values in the different features of the classes(labels) of my dataset.

codlix
  • 858
  • 1
  • 8
  • 24
  • 1
    Please post attempted plotting code so we can see why #2 does not work. – Parfait Dec 27 '18 at 23:40
  • `grouped = df.groupby['Class']` `for name, group in grouped: group.plot.bar()` the error-message is acutally "'str' object has no attribute 'plot' - which leads me to the thought, that there is acutally no dataframe in the variable grouped. – codlix Dec 27 '18 at 23:51
  • 1
    `groupby` is a method so needs`()` in caller: `df.groupby(['Class'])`. In fact the line before loop should have erred. Please edit post with full attempted code block (not in comments) for a [MCVE]. – Parfait Dec 28 '18 at 01:29

3 Answers3

3

A solution using seaborn

import seaborn as sns
ndf = pd.melt(df, id_vars="CLASS", var_name="feature", value_name="val")
sns.catplot("feature", "val", col="CLASS", data=ndf, kind="bar", col_wrap=1)
plt.show()

plt

meW
  • 3,832
  • 7
  • 27
  • thank you for the reponse. because i want to display the amount of nan-values in each feature of a class i work with a dataframegroupby-object which, if i tackle the problem with your solution gives me the error "cannout access callable attribute 'copy' of 'DataFraeGroupBy' objects, try using the apply method – codlix Dec 28 '18 at 00:19
  • Can you show a sample data onto which you're trying to plot. – meW Dec 28 '18 at 00:23
  • I can not provide and exact data-example because the data is sensitiv but i can give you an symbolic data-example that is from the logic the same as the one i use. The basic data looks like this [link](https://imgur.com/a/dimX9k1) only that it has ten feature-columns and more than 500000 rows. after i use a query like this `g = df.groupby('CLASS') g.count().rsub(g.size(), axis=0)` i get the amount of nan values in the features of a class. looks like this -> [link](https://imgur.com/0nytSHI) if i plot it with all the classes it is not readable, so i want to have a plot for "Ford Z" etc. – codlix Dec 28 '18 at 00:37
  • 1
    If you pass given data https://imgur.com/0nytSHI to above solution, then you'll probably arrive at right answer. – meW Dec 28 '18 at 00:40
2

Grouping is the way to go, just set the labels

for name, grp in df3.groupby('CLASS'):
    ax = grp.plot.bar()
    ax.set_xticks([])
    ax.set_xlabel(name)
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • Thanks for the solution - but i get the error: 'str' object has no attribute 'plot' in line 2 – codlix Dec 27 '18 at 23:58
  • i now understand what is happening here. the columns have the type string, so there can be nothing plotted because it is not numeric. thats why i recieved the error message. what i actually want to do, is to print out the count of the features in relationship to the class. it works for the whole dataframegroupby object, but not for the different groups – codlix Dec 28 '18 at 00:16
  • @FelTry2, you mean the data are Object type? You can convert them using astype('float') first – Vaishali Dec 28 '18 at 00:49
0

With the solution provided by @meW i was able to achieve a result that is near my goal.

I had to do two steps to actually use his solution.

  1. Cast the GroupBy-Object to an DataFrame-Object via df = pd.DataFrame(df.groupBy('Class').count().rsub(df.groupby('Class').size(), axis =0)
  2. Through the groupby-query the Class-Column was transformed to the index so i had to transform it back via grouped['class'] = grouped.index

Two Questions that rise from this solution: is it possible to fit the ticks to the different amounts of nan. because there are classes with only 5-10 nan values in the features and classes with over 1000 nan-values. (see pictures)

Many NaNLess NaN

Second Question - The Feature-Names are only shown in the last plot - is there a way to add them to all x-axis of all plots

codlix
  • 858
  • 1
  • 8
  • 24