7

I managed to download datasets from Kaggle using Kaggle API. And the data was stored under the directory of /databricks/driver.

%sh pip install kaggle
%sh
export KAGGLE_USERNAME=my_name
export KAGGLE_KEY=my_key
kaggle competitions download -c ncaaw-march-mania-2021
%sh unzip ncaaw-march-mania-2021.zip

The problem is: How can I use them in DBFS? The following is how I read data and the error I got when I tried to use pyspark to read csv files:

spark.read.csv('/databricks/driver/WDataFiles_Stage1/Cities.csv')
AnalysisException: Path does not exist: dbfs:/databricks/driver/WDataFiles_Stage1/Cities.csv
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Memphis Meng
  • 1,267
  • 2
  • 13
  • 34

2 Answers2

3

spark.read... works with DBFS paths by default, so you have two choices:

  • use file:/databricks/driver/... to force reading from the local file system - it will work on the community edition because it's single node cluster. It won't work on the distributed cluster

  • copy files to DBFS using the dbutils.fs.cp command (docs) and read from DBFS:

dbutils.fs.cp("file:/databricks/driver/WDataFiles_Stage1/Cities.csv", 
   "/FileStore/Cities.csv")
df = spark.read.csv("/FileStore/Cities.csv")
....
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
0

I have really struggled with the Kaggle API so I use opendatasets. I installed my libraries on the cluster. enter image description here

    import opendatasets as od
    
    od.download("https://www.kaggle.com/competitions/tlvmc-parkinsons-freezing-gait-prediction/data","/dbfs/FileStore/mypath/")

The output, when running this, shows first the zip being downloaded. Once the download is complete, it automatically extracts or unzips the files.

Extracting archive /dbfs/FileStore/mypath/tlvmc-parkinsons-freezing-gait-prediction/tlvmc-parkinsons-freezing-gait-prediction.zip to /dbfs/FileStore/mypath/tlvmc-parkinsons-freezing-gait-prediction

documentation

Climbs_lika_Spyder
  • 6,004
  • 3
  • 39
  • 53