I would like to create a dataframe from a csv file that I will retrieve via streaming:
import requests
url = "https://{0}:8443/gateway/default/webhdfs/v1/{1}?op=OPEN".format(host, filepath)
r = requests.get(url,
auth=(username, password),
verify=False,
allow_redirects=True,
stream=True)
chunk_size = 1024
for chunk in r.iter_content(chunk_size):
# how to load the data
How can the data be loaded into spark from the http stream?
Note that it isn't possible to use HDFS format for retrieving the data - WebHDFS must be used.