Writing JSON content to HDFS location using Python

Question

I am trying to write JSON content to HDFS location using Python,but for every key and value in my JSON content, I am seeing prefix of u and ''.

Original JSON content { "id": 2344556, "resource_type": "user", "ext_uid": null, "email": "Richard.John@abc.com", "name": "Rich John", "role": "manager", "role_id": 5944 }

import os 
import json
import requests
from requests.auth import HTTPBasicAuth 
users_url = "https://api.test.com/api/k1/users"
response = requests.get(users_url,auth = authParams) 
users_json = response.json()
os.system(' echo "%s" | hadoop fs -put - hdfs://hadoopnode/user/uui123/API_JSON/user.json' %(users_json))

Output it's writing in hdfs location

{ u'id': u'2344556', u'resource_type': u'user', u'ext_uid': u'null', u'email': u'Richard.John@abc.com', u'name': u'Rich John', u'role': u'manager', u'role_id': u'5944' } How I can get my original content hdfs file without prefix u and ''

You need to `json.dumps` the dictionary into a string , but you really shouldn't be using hdfs for such a small file — OneCricketeer, Dec 16 '20 at 00:03
Also, if you were to use Spark, then you shouldn't be using shell commands to write to hdfs — OneCricketeer, Dec 16 '20 at 00:04
@OneCricketeer ,it is a big file and want to write this to hdfs before i process it with dataframe. — Rahul, Dec 16 '20 at 03:34
@OneCricketeer, I tried this json.dumps and it's writing empty file — Rahul, Dec 16 '20 at 03:47
Well, I'm not sure what you changed, but anything less than 128MB is not big for HDFS, and any http response that large will likely timeout with Python requests module defaults — OneCricketeer, Dec 16 '20 at 05:39
@OneCricketeer, here is the actual problem, when read JSON response directly in to spark I'm getting a single _corrept_record column,but when I write JSON response to a file and if i read that file with spark is working fine. https://stackoverflow.com/questions/35409539/corrupt-record-error-when-reading-a-json-file-into-spark — Rahul, Dec 16 '20 at 06:49
If that's the actual problem, then edit your question to show that. The issue would be the url is returning you more than a single object or it has indentation, and Spark expects jsonlines (one json object per line) format, not a single object spread over multiple lines like a http response would return — OneCricketeer, Dec 16 '20 at 13:40

Writing JSON content to HDFS location using Python

0 Answers0