2

I am having a problem in converting .csv file to multiline json file using pyspark.

I have a csv file read via spark rdd and I need to convert this to multiline json using pyspark.

Here is my code:

import json

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("jsonconversion").getOrCreate()

df = spark.read.format("csv").option("header","True").load(csv_file)
df.show()
df_json = df.toJSON()

for row in df_json.collect():

line = json.loads(row)

result =[]



for key,value in list(line.items()):

    if key == 'FieldName':

        FieldName =line['FieldName']

        del line['FieldName']

        result.append({FieldName:line})

        res =result

        with open("D:/tasklist/jsaonoutput.json",'a+')as f:

            f.write(json.dumps(res, indent=4, separators=(',',':')))

I need the output in below format.

{
"Name":{
"DataType":"String",
"Length":4,
"Required":"Y",
"Output":"Y",
"Address": "N",
"Phone Number":"N",
"DoorNumber":"N/A"
"Street":"N",
"Locality":"N/A",
"State":"N/A"
  }
  }

My Input CSV file Looks like this:

enter image description here

I am new to Pyspark, Any leads to modify this code to a working code will be much appreciated.

Thank you in advance.

khadar
  • 137
  • 1
  • 10

1 Answers1

0

Try the following code. It first creates pandas dataframe from spark DF (unless you care doing some else with spark df, you can load csv file directly into pandas). From pandas df, it creates groups based on FieldName column and then writes to file where json.dumps takes care of formatting.

import json
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("jsonconversion").getOrCreate()
df = spark.read.format("csv").option("header","True").load(csv_file)
df.show()

df_pandas_grped = df.toPandas().groupby('FieldName')
final_dict = {}
for key, grp in df_pandas_grped:
    final_dict[str(key)] = grp.to_dict('records') 

with open("D:/tasklist/jsaonoutput.json",'w')as f:
        f.write(json.dumps(final_dict,indent=4))
Manoj Singh
  • 1,627
  • 12
  • 21
  • final_dict[str(key)] = grp.to_dict('records') can you please let me know above statement where the record referring to ? – khadar Dec 09 '18 at 10:37
  • Its the `orient` parameter of [pandas.DataFrame.to_dict(orient='dict', into=)](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.to_dict.html) . You can also find more details at [Convert a Pandas DataFrame to a dictionary](https://stackoverflow.com/questions/26716616/convert-a-pandas-dataframe-to-a-dictionary) – Manoj Singh Dec 09 '18 at 14:14
  • Thank you for the information. I do face another issue when executing the above script, it shows pandas import error and I cannot able to install the pandas. requesting you to please help me on this – khadar Dec 11 '18 at 03:22
  • `pip install pandas` should do the trick depending on environment you are using. You can check out other posts on similar install issue: [install pandas](https://stackoverflow.com/search?q=install+pandas) – Manoj Singh Dec 11 '18 at 03:31
  • @khadar- you want to read csv into dataframe and convert into given json format and write it into a file?? or something else – vikrant rana Dec 20 '18 at 13:58
  • @khadar. I checked your code and it seems fine. what problem you are facing? – vikrant rana Dec 26 '18 at 10:30