0

I would like to know what is most efficient way to test if a large file exists locally (without loading it in memory). If it doesn't exists (or not readable) then download it. The goal is to upload the data in a pandas DataFrame.

I wrote the snippet below which is working (and tested with a small file). What about correctness and pythonic programming?

url = "http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv" # 4.7kB  
file = "./test_file.csv" 

try:
    os.open( file, os.O_RDONLY)
    df_data = pd.read_csv( file, index_col=0)

except: 
    df_data = pd.read_csv( url, index_col=0)
    df_data.to_csv( file)
alEx
  • 193
  • 6
  • 12
  • You can pass `nrows=1` and then check the df.shape or length, so this will just read a single row – EdChum May 15 '17 at 08:34
  • 2
    To check if a file exists , check this - http://stackoverflow.com/questions/82831/how-do-i-check-whether-a-file-exists-using-python , put a os.path.isfile check before downloading and reading into a df and in your except handle errors that are more related to the file having invalid characters which cause problem when loading into df. – Satyadev May 15 '17 at 08:36
  • 2
    `import os.path` then `os.path.isfile(fname)` will return True if the file exists – Nuageux May 15 '17 at 08:37
  • os.path.isfile( file) seems to be the best solution: to check before downloading a huge file: if not os.path.isfile( file): – alEx May 15 '17 at 15:36

3 Answers3

4

I think you can use try and catch FileNotFoundError:

url = "http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv" # 4.7kB  
file = "./test_file.csv" 

try:
    df_data = pd.read_csv(file, index_col=0)

except FileNotFoundError: 
    df_data = pd.read_csv(url, index_col=0)
    df_data.to_csv(file)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

You can check if the file exists, and load from an url if it does not:

import os
import pandas as pd

url = "http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv"
f = "./test.csv"

if os.path.exists(f):
    df = pd.read_csv(f)
else:
    df = pd.read_csv(url)
Robbie
  • 4,672
  • 1
  • 19
  • 24
0

os.path.isfile( file) seems to me the best solution: checking before downloading a huge file:

if not os.path.isfile( file):
       urllib.urlretrieve(url, file)
df_data = pd.read_csv( file, index_col=0)

It's slower than uploading it directly in memory from url(dowload to disk and then upload into memory), but safer in my situation...
Thx to all

alEx
  • 193
  • 6
  • 12