1

Screenshot of the data enter image description here

I'm trying to clean some data to answer questions on it. I appended a sum row and whilst most of the sum of the column add up to 1, if I want to find the top 10 or 20 common illnesses, how do I 'filter' by the 'Total' which is a row? For instance, I would like to have all the columns (I think index 5 to 135) where the 'Total' ( which is the last row in the dataframe) is > 20 and then save those for the main analysis.

Thank you

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53

1 Answers1

2

We can use indexing with .loc

df.loc[:, df.loc['Total'] > 20]
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • I get this error: '>' not supported between instances of 'str' and 'int' – Fernandes92 May 28 '22 at 17:51
  • 1
    @LauraFernandes probably you have mixed datatypes in `Total` row. Check: `df.loc[:, pd.to_numeric(df.loc['Total'], errors='coerce') > 20]` – Shubham Sharma May 28 '22 at 18:01
  • OMG THANK YOU SOOO MUCH!! I spent hours on it :') . Your above code worked!! Can I bug you on one last thing - how to make panda plots with this 'Total' given it has mixed datatypes? For example if I try new_df.loc['Total'].plot() it says no numeric data to plot – Fernandes92 May 28 '22 at 19:13
  • 1
    Glad to help! You have to change the dtype to numeric. `pd.to_numeric(df.loc['Total'], errors='coerce').plot()` – Shubham Sharma May 28 '22 at 19:23
  • Hey, will I always need to use pd.to_numeric? What if I want to just plot using matplotlib and I want to plot Total vs Molars for example, how would that work :( Thanks – Fernandes92 May 29 '22 at 09:57
  • @LauraFernandes In that case convert the row to numeric type then store the row into some variable, and plot using matplotlib. – Shubham Sharma May 29 '22 at 10:47
  • Totals = pd.to_numeric(new_df.loc['Total'],errors = 'coerce') long_claws = new_df['Claw/nails overlong'] plt.scatter(Totals, long_claws). <-- I tried this but I keep getting errors that x and y must be the same size :(( – Fernandes92 May 29 '22 at 11:22
  • @LauraFernandes Its very difficult to tell what you are trying to do here, maybe if you can provide a little bit more context, I might be able to help. – Shubham Sharma May 29 '22 at 11:25
  • I would like to make several plots where it shows total vs the breed, and/or most common disorders again using the total. – Fernandes92 May 29 '22 at 13:18
  • 1
    Laura, I guess you are approaching this problem the hard way, there should be another simpler way by which you can achieve the same thing. TBH, I don't see the relevance of calculating the total row in the first place. – Shubham Sharma May 29 '22 at 13:42
  • Oh okay, it's just because there are 135 columns so I thought creating/appending a row of totals would help investigate the top 10 or top 20 common disorders and then cross referencing them against breeds or male/female and creating plots accordingly. – Fernandes92 May 29 '22 at 14:07
  • 1
    I would suggest to first identify the disease column names, then do a groupby on `Breeds` and sum the values in disease columns this will help you in identifying the count of particular disease per breeed. After this you can easily plot the data using matplotlib or seaborn. – Shubham Sharma May 29 '22 at 14:15
  • can you help? https://stackoverflow.com/q/72425505/14127922?stw=2 – Fernandes92 May 29 '22 at 17:59
  • 1
    Hey Laura! I see that you already got the answer. – Shubham Sharma May 30 '22 at 12:52
  • kind-off but could do with more help :D – Fernandes92 May 30 '22 at 17:28
  • let me check the question – Shubham Sharma May 30 '22 at 17:47
  • thanks it was related to suggestion above. – Fernandes92 May 30 '22 at 17:57