I have referenced the following to do the same in databricks pyspark:
Iterating through a dataframe and plotting each column
Iterate through columns to generate barplots while using groupby
How can we loop through items in a dataframe and create a bar charts for each 'group' of items?
but I am not able to write something similar that will work in pyspark.
My spark dataframe looks like this:
data = [{'id1': '1150', 'org':'south_org', 'id2':300, 'fields':'firstname', 'value':'complete'},
{'id1': '1150', 'org':'south_org', 'id2':300, 'fields':'lastname', 'value':'complete'},
{'id1': '1150', 'org':'south_org', 'id2':300, 'fields':'streetaddress', 'value':'missing'},
{'id1': '1150', 'org':'south_org', 'id2':300, 'fields':'state', 'value':'missing'},
{'id1': '1150', 'org':'south_org', 'id2':300, 'fields':'zip', 'value':'complete'},
{'id1': '3000', 'org':'north_org', 'id2':310, 'fields':'firstname', 'value':'complete'},
{'id1': '3000', 'org':'north_org', 'id2':310, 'fields':'lastname', 'value':'complete'},
{'id1': '3000', 'org':'north_org', 'id2':310, 'fields':'streetaddress', 'value':'error'},
{'id1': '3000', 'org':'north_org', 'id2':310, 'fields':'state', 'value':'complete'},
{'id1': '3000', 'org':'north_org', 'id2':310, 'fields':'zip', 'value':'complete'},
{'id1': '1110', 'org':'west_org', 'id2':315, 'fields':'firstname', 'value':'complete'},
{'id1': '1110', 'org':'west_org', 'id2':315, 'fields':'lastname', 'value':'complete'},
{'id1': '1110', 'org':'west_org', 'id2':315, 'fields':'streetaddress', 'value':'complete'},
{'id1': '1110', 'org':'west_org', 'id2':315, 'fields':'state', 'value':'complete'},
{'id1': '1110', 'org':'west_org', 'id2':315, 'fields':'zip', 'value':'complete'}
]
I have 130 million rows due to long format
I would like to create a separate bargraph after iterating through each id1 values. My approach is to iterate through that column and create a bargraph.
I tried the following approaches: when I first tried converting it to pandas df, i got an OOM error.
I also followed along the suggested codes and in some, I got some errors:
grouped = data.groupby('id1')
for i, (groupname, group) in enumerate(grouped):
axes[i].plot(group.id1, group.values, kind='stacked')
axes[i].set_title(groupname)
This next one also did not work:
import matplotlib.pyplot as plt
import seaborn as sb
for id1, data in wide_df:
plt.figure(figsize=(20,5))
sns.barplot(data=data, x='fields',y='status')
plt.tight_layout()
plt.show()
What I am trying to achieve is this, for each id1 (I have 1200 unique id1s), is to create individual graph like below, where the categories in the y-axis is the 'fields' column from the spark dataframe