How to query an extremelly large online csv file with python everyday for new information?

Question

I'm using python to analyse covid-19 data in Brazil. The federal government shares a csv file with the record of each vaccination in the country. This csv file has over 170GB.

For my study I need to query this csv file to obtain the quantity of vaccination grouped by city and by day. In sql it would be something like:

select city, day, Count(*)
from my_table
group by city, day

How can I extract this information from the online csv file since it is so large?

This file is updated daily, since new people get vaccinated every day. It means that new rows are appended to the file every day.

I'd like to extract/update the counters daily. Is there a smart/quick way to check for the new rows in the csv file and update the counters?

I cannot download the whole file everyday and import it to a database.

The data is available here: https://qsprod.saude.gov.br/extensions/covid-19_html/covid-19_html.html in the link Dados Completos a 153468857093 byte CSV on S3.

The input file sample is available here: https://drive.google.com/file/d/1LRVJMKeE0wzuGshmfsI7pnfpHA800iph/view?usp=sharing

How about just upload the CSV into a SQL database? It would probably turn into less than 170GB with the right compression settings — OneCricketeer, Nov 16 '21 at 12:42
@OneCricketeer – this is not a dupe of https://stackoverflow.com/questions/8424771/parallel-processing-of-a-large-csv-file-in-python since that's talking about an offline file. — AKX, Nov 16 '21 at 12:42
@AKX The file will need to be downloaded (or streamed if server allows), regardless — OneCricketeer, Nov 16 '21 at 12:43
Again, if you are comfortable with SQL, why don't you import the data to a database and use that? It's going to be a lot more efficient than using Python directly on the CSV file, and easily worth the effort if you have more than 2-3 queries. — tripleee, Nov 16 '21 at 12:43
@OneCricketeer Indeed, if the server supports streaming, a group-by like this could be done with `for x in resp.iter_lines()` and a `Counter`. That's why this is different. — AKX, Nov 16 '21 at 12:44
The sample is google drive is not something that represents your question, Please upload a better sample. — balderman, Nov 16 '21 at 12:48
If my nonexistent Portuguese is correct, the data OP wants is the `Dados Completos` link from https://opendatasus.saude.gov.br/en/dataset/covid-19-vacinacao/resource/ef3bd0b8-b605-474b-9ae5-c97390c197a8 which points to a 153468857093 byte CSV on S3. Could be streamed, could be chunked :) Also, I imagine the other Dados files are per-region or something, so could also be parallelized over them. — AKX, Nov 16 '21 at 12:48
@AKX yes.. but the question is about different data struct. Anyway - I think *The Aelfinn* answer from https://stackoverflow.com/questions/35371043/use-python-requests-to-download-csv can work here very well. — balderman, Nov 16 '21 at 12:53
@balderman Indeed. As I said in a comment above, `resp.iter_lines()` and a counter. — AKX, Nov 16 '21 at 12:55

score 2 · Answer 1 · answered Nov 16 '21 at 13:00

Something like this seems to do the trick (assuming you have enough memory and a stable Internet connection to process the file in a single request).

The data in the Counter ends up being e.g.

Counter({
  ('BRASILIA', '2021-03-18'): 2,
  ('SAO PAULO', '2021-03-26'): 1,
  ('INDAIATUBA', '2021-08-09'): 1,
  ...
})

You'll want to remove the islice() to process more than the 100 first rows, of course.

import pickle
from collections import Counter
from itertools import islice

import requests
import csv

DATA_URL = "https://s3-sa-east-1.amazonaws.com/ckan.saude.gov.br/PNI/vacina/completo/2021-11-15/part-00000-d217d29f-9db0-4280-ad94-ff0afe3d8b11-c000.csv"

resp = requests.get(DATA_URL, stream=True)
resp.raise_for_status()
resp.encoding = "UTF-8"

counter = Counter()

for row in islice(
    csv.DictReader(resp.iter_lines(decode_unicode=True), delimiter=";"), 100
):
    key = (row.get("estabelecimento_municipio_nome"), row.get("vacina_dataaplicacao"))
    counter[key] += 1


with open("data.pickle", "wb") as outf:
    pickle.dump(counter, outf)

For anything more complex, I really would recommend downloading the file and LOADing it e.g. to a PostgreSQL table.

Thanks for the response! Now lets say I have to update my counters daily, since everyday new people get vaccinated. Is there a smart/quick way to check for the new rows in the csv file? — Luiz Fernando Puttow Southier, Nov 16 '21 at 17:52
Well, _if_ you are sure that new rows get added to the end of the file (which probably isn't the case since the 100 first rows include entries for 2021-03 and 2021-08) you could keep track of the size of the old file, then only request new bytes and so on... But it doesn't seem like that that's the case. — AKX, Nov 16 '21 at 18:48

How to query an extremelly large online csv file with python everyday for new information?

1 Answers1