How can I import data downloaded from Kaggle to DBFS using Databricks Community Edition?

Question

I managed to download datasets from Kaggle using Kaggle API. And the data was stored under the directory of /databricks/driver.

%sh pip install kaggle
%sh
export KAGGLE_USERNAME=my_name
export KAGGLE_KEY=my_key
kaggle competitions download -c ncaaw-march-mania-2021
%sh unzip ncaaw-march-mania-2021.zip

The problem is: How can I use them in DBFS? The following is how I read data and the error I got when I tried to use pyspark to read csv files:

spark.read.csv('/databricks/driver/WDataFiles_Stage1/Cities.csv')
AnalysisException: Path does not exist: dbfs:/databricks/driver/WDataFiles_Stage1/Cities.csv

score 3 · Accepted Answer · answered Aug 07 '21 at 15:08

spark.read... works with DBFS paths by default, so you have two choices:

use file:/databricks/driver/... to force reading from the local file system - it will work on the community edition because it's single node cluster. It won't work on the distributed cluster
copy files to DBFS using the dbutils.fs.cp command (docs) and read from DBFS:

dbutils.fs.cp("file:/databricks/driver/WDataFiles_Stage1/Cities.csv", 
   "/FileStore/Cities.csv")
df = spark.read.csv("/FileStore/Cities.csv")
....

Climbs_lika_Spyder · Answer 2 · 2023-03-23T14:27:59.283

I have really struggled with the Kaggle API so I use opendatasets. I installed my libraries on the cluster.

    import opendatasets as od
    
    od.download("https://www.kaggle.com/competitions/tlvmc-parkinsons-freezing-gait-prediction/data","/dbfs/FileStore/mypath/")

The output, when running this, shows first the zip being downloaded. Once the download is complete, it automatically extracts or unzips the files.

Extracting archive /dbfs/FileStore/mypath/tlvmc-parkinsons-freezing-gait-prediction/tlvmc-parkinsons-freezing-gait-prediction.zip to /dbfs/FileStore/mypath/tlvmc-parkinsons-freezing-gait-prediction

documentation

How can I import data downloaded from Kaggle to DBFS using Databricks Community Edition?

2 Answers2