I am a new coder, and for my class we have an assignment where we are supposed to be making an API call to an outside dataset and then plotting something interesting about the data. I made my API call to a NYC tree census data set. In the data, it shows both tree species, and health status (Good, Fair, Poor, Dead). I want to make a stacked bar plot showing the percentage of health status for each tree. For example, I want one bar for Maple trees, showing that 56% are good, 26% are fair, 13% are poor, and 5% are dead. I'm not really sure how to accomplish all of this. Here is a screenshot showing how my dataset looks. Thanks for any advice!
Asked
Active
Viewed 69 times
0
-
2It is not recommended that data be presented as images. It can be toy data and should be presented in text. It is also desirable to present the code that you are working on. This will reduce the burden on the respondent and make it easier to answer. – r-beginners Jun 19 '21 at 07:53
-
For this kind of data, it is necessary to determine how many different types of trees there are and focus on the top trees to visualize. Once the tree types are narrowed down, we can calculate the composition ratio of them by health attributes and graph them. – r-beginners Jun 19 '21 at 07:57
1 Answers
0
- I've used kaggle as source of data. I did find this as well API I did not use as it is so slow for me
- data I've used has no dead trees, just poor, fair and good as status
- I have used pandas-percentage-of-total-with-groupby technique for calculating percentages
- I prefer plotly to matplotlib for plotting. Both are simple to use
- there really are too many bars for this to be a high quality visualisation
get data from API (kaggle)
import kaggle.cli
import sys
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
# search for data set
# sys.argv = [sys.argv[0]] + "datasets list -s \"2015-street-tree-census-tree-data.csv\"".split(" ")
# kaggle.cli.main()
# download data set
sys.argv = [sys.argv[0]] + "datasets download new-york-city/ny-2015-street-tree-census-tree-data".split(" ")
kaggle.cli.main()
zfile = ZipFile("ny-2015-street-tree-census-tree-data.zip")
zfile.infolist()
# use CSV
df = pd.read_csv(zfile.open(zfile.infolist()[0]))
prepare data and plot using plotly
import plotly.express as px
spc = 'spc_common'
# aggregate the data and shape it for plotting
dfa = (
df.groupby([spc, "health"])
.agg({"tree_id": "count"})
.groupby(level=spc)
.apply(lambda x: x / x.sum())
.unstack("health")
.droplevel(0, 1)
)
fig = px.bar(
dfa.reset_index(),
x=spc,
y=["Poor", "Fair", "Good"],
color_discrete_sequence=["red", "blue", "green"],
)
fig.update_layout(yaxis={"tickformat": "%"})
output
matplotlib
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(14, 3))
dfa.plot(kind="bar", stacked=True, ax=ax)

Rob Raymond
- 29,118
- 3
- 14
- 30