0

I have referenced the following to do the same in databricks pyspark:

Iterating through a dataframe and plotting each column

Iterate through columns to generate barplots while using groupby

How can we loop through items in a dataframe and create a bar charts for each 'group' of items?

but I am not able to write something similar that will work in pyspark.

My spark dataframe looks like this:

data = [{'id1': '1150', 'org':'south_org', 'id2':300, 'fields':'firstname', 'value':'complete'},
        {'id1': '1150', 'org':'south_org', 'id2':300, 'fields':'lastname', 'value':'complete'},
        {'id1': '1150', 'org':'south_org', 'id2':300, 'fields':'streetaddress', 'value':'missing'},
        {'id1': '1150', 'org':'south_org', 'id2':300, 'fields':'state', 'value':'missing'},
        {'id1': '1150', 'org':'south_org', 'id2':300, 'fields':'zip', 'value':'complete'},
        {'id1': '3000', 'org':'north_org', 'id2':310, 'fields':'firstname', 'value':'complete'},
        {'id1': '3000', 'org':'north_org', 'id2':310, 'fields':'lastname', 'value':'complete'},
        {'id1': '3000', 'org':'north_org', 'id2':310, 'fields':'streetaddress', 'value':'error'},
        {'id1': '3000', 'org':'north_org', 'id2':310, 'fields':'state', 'value':'complete'},
        {'id1': '3000', 'org':'north_org', 'id2':310, 'fields':'zip', 'value':'complete'},
        {'id1': '1110', 'org':'west_org', 'id2':315, 'fields':'firstname', 'value':'complete'},
        {'id1': '1110', 'org':'west_org', 'id2':315, 'fields':'lastname', 'value':'complete'},
        {'id1': '1110', 'org':'west_org', 'id2':315, 'fields':'streetaddress', 'value':'complete'},
        {'id1': '1110', 'org':'west_org', 'id2':315, 'fields':'state', 'value':'complete'},
        {'id1': '1110', 'org':'west_org', 'id2':315, 'fields':'zip', 'value':'complete'}
       ]

I have 130 million rows due to long format

I would like to create a separate bargraph after iterating through each id1 values. My approach is to iterate through that column and create a bargraph.

I tried the following approaches: when I first tried converting it to pandas df, i got an OOM error.

I also followed along the suggested codes and in some, I got some errors:

grouped = data.groupby('id1')

for i, (groupname, group) in enumerate(grouped):
  axes[i].plot(group.id1, group.values, kind='stacked')
  axes[i].set_title(groupname)

This next one also did not work:

import matplotlib.pyplot as plt
import seaborn as sb

for id1, data in wide_df:
  plt.figure(figsize=(20,5))
  sns.barplot(data=data, x='fields',y='status')
  plt.tight_layout()
  plt.show()

What I am trying to achieve is this, for each id1 (I have 1200 unique id1s), is to create individual graph like below, where the categories in the y-axis is the 'fields' column from the spark dataframe

expected output

budding pro
  • 135
  • 1
  • 2
  • 10

1 Answers1

0

Partial Answer, I hope:

I think I am on for something on this one. I was able to create individual dataframe per id1, but still working on how to create a bargraph from each dataframe. HOpefully someone can guide me on how to add them.

for i in data.select('id1').distinct().collect():
  print(i)
  identifation1 = i.id1
  per_id = data.filter(data.id1== identification1)
  per_df = per_id.groupBy('fields','value').count().sort('fields')

resulting to something like:

Row(id1 = 1150)
+-----------+---------+--------+
|     fields|    value|   count|
+-----------+---------+--------+
|  firstname| complete|      10|
|  firstname|    error|       8|
|   lastname|  missing|      25|
|   lastname|    error|       2|
|        zip| complete|      35|
...
|otherfields| complete|      50|
+-----------+---------+--------+

I need to move on and create bar graph

budding pro
  • 135
  • 1
  • 2
  • 10