1

I am trying to get a dataset from a URL through pandas

df1=pd.read_csv('https://data.cityofnewyork.us/api/views/m6nq-qud6/rows.csv?accessType=DOWNLOAD&bom=true&format=true',low_memory=False)

then I am trying to convert it into the spark

df=spark.createDataFrame(df1.astype(str)) 
df.printSchema()
df.show()

I am getting file not found error

FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\hp\AppData\Local\Temp\spark-69fc2b6c-2af6-429c-a51e-e639f3430e37\pyspark-954bf663-29ea-4ca5-b79b-86dfe34cbb4c\tmpeome_gmf'

I am not getting what is going wrong here I thought it is because of Nan Values so I have tried using dropna and then converting it into spark but it still won't work. If there is another way to directly get a dataset from URL please let me know URL (https://catalog.data.gov/dataset/2021-yellow-taxi-trip-data-jan-jul)

Atpug627
  • 23
  • 1
  • 7

1 Answers1

2
import urllib

link = "http://www.somesite.com/details.pl?urn=2344"
f = urllib.urlopen(link)
myfile = f.read()
print(myfile)

EDIT (2018-06-25): Since Python 3, the legacy urllib.urlopen() was replaced by urllib.request.urlopen() (see notes from https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen for details).

If you're using Python 3, see answers by Martin Thoma or i.n.n.m within this question: https://stackoverflow.com/a/28040508/158111 (Python 2/3 compat) https://stackoverflow.com/a/45886824/158111 (Python 3)

Or, just get this library here: http://docs.python-requests.org/en/latest/ and seriously use it :)

import requests

link = "http://www.somesite.com/details.pl?urn=2344"
f = requests.get(link)
print(f.text)

use below code. hope it helps

from pyspark.sql import SparkSession
import requests

link = "https://data.cityofnewyork.us/api/views/m6nq-qud6/rows.csv?accessType=DOWNLOAD&bom=true&format=true"
r = requests.get(link, allow_redirects=True)
open(r'C:\Users\Saurabh\Desktop\file.csv', 'wb').write(r.content)

spark = SparkSession.builder.master("local").appName("records").getOrCreate()

df = spark.read.format("csv").option("header","true")\
    .option("inferSchema","true").load(r"C:\Users\Saurabh\Desktop\file.csv")

Please use updated code as requested in comments. You can generalize the path as well using pathlib library. Don't forget to import/pip install the pathlib library.

Note that i have used parent class/method to store the file in the directory where the .py file will be present by default. In any system, wherever will be the source(.py) will be saved, the file will get downloaded first over there and then will use same named file to load to your dataframe(df). Hope this helps :)

from pyspark.sql import SparkSession
import requests
import pathlib

fn = pathlib.Path(__file__).parent / 'file.csv'
link = "https://data.cityofnewyork.us/api/views/m6nq-qud6/rows.csv?accessType=DOWNLOAD&bom=true&format=true"
r = requests.get(link, allow_redirects=True)
open('file.csv', 'wb').write(r.content)

spark = SparkSession.builder.master("local").appName("records").getOrCreate()

df = spark.read.format("csv").option("header","true")\
    .option("inferSchema","true").load("file.csv")

Saurabh
  • 127
  • 7
  • this is working fine for me but the problem is I want the spark code to work in anyone's machine not only in mine so if I give a path from the device then it will only work for me – Atpug627 Mar 23 '22 at 11:24
  • Added the updated code. Please try out and please update, if it helps. – Saurabh Mar 27 '22 at 17:28