0

I would like to write Python script that download csv file from URL and then return this in JSON. The problem is that I need execute it as fast as it possible. What is the best way to do it? I was thinking about something like this:

r_bytes = requests.get(URL).content
r = r_bytes.decode('utf8')
reader = csv.DictReader(io.StringIO(r))
json_data = json.dumps(list(reader))

What do you think? It doesn't look good for me but I can t find any better way to solve this problem.

  • Can we have and example of the csv you want to convert ? – rmarquet21 Sep 13 '21 at 13:31
  • 2
    I suggest starting from profiling your code to find which part (retrieving data from URL? parsing CSV? encoding to JSON?) requires most time and focusing on improving it – Daweo Sep 13 '21 at 13:34
  • [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) – martineau Sep 13 '21 at 13:35
  • @rmarquet21 Can t have any examples right now. But it s simple csv about 8 columns and about 1000 rows (could be more in the future). The problem is that this file will be updating in real time and I want to to download updates as often as it possible – patryk_ostrowski Sep 13 '21 at 13:41
  • Consider Pandas, it has a fast and convenient way to convert CSVs into JSON. – SwingingChad Sep 13 '21 at 13:43
  • Thanks evryone. I gonna try later to profile this :) – patryk_ostrowski Sep 13 '21 at 14:52

1 Answers1

1

I tried comparing your conversion process with pandas and used this code:

import io
import pandas as pd
import requests
import json
import csv
import time

r_bytes = requests.get("https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv").content
print("finished download")
r = r_bytes.decode('utf8')
print("finished decode")

start_df_timestamp = time.time()
df = pd.read_csv(io.StringIO(r), sep=";")
result_df = json.dumps(df.to_dict('records'))
end_df_timestamp = time.time()
print("The df method took {d_t}s".format(d_t=end_df_timestamp-start_df_timestamp))


start_csv_reader_timestamp = time.time()
reader = csv.DictReader(io.StringIO(r))
result_csv_reader = json.dumps(list(reader))
end_csv_reader_timestamp = time.time()
print("The csv-reader method took {d_t}s".format(d_t=end_csv_reader_timestamp-start_csv_reader_timestamp))

and the result was:

finished download
finished decode
The df method took 0.200181245803833s
The csv-reader method took 0.3164360523223877s

this was using a random 37k row CSV file and i noticed that downloading it was by far the most time-intensive thing to do. Even if the the pandas.df functions were faster for me, you should probably try to profile your code, to see whether the conversion really is significantly adding to your runtime. :-)

PS: If you need to constantly monitor the CSV and processing updates turns out to be time-intensive, you could use hashes to only process alterations if the CSV has changed since your last download.

  • Thanks I gonna try to profile my code later. Also forgot that using panda here could be faster. Thank you also for your advice with using hashes :) – patryk_ostrowski Sep 13 '21 at 14:51