I'm using python to analyse covid-19 data in Brazil. The federal government shares a csv file with the record of each vaccination in the country. This csv file has over 170GB.
For my study I need to query this csv file to obtain the quantity
of vaccination grouped by city
and by day
. In sql it would be something like:
select city, day, Count(*)
from my_table
group by city, day
How can I extract this information from the online csv file since it is so large?
This file is updated daily, since new people get vaccinated every day. It means that new rows are appended to the file every day.
I'd like to extract/update the counters daily. Is there a smart/quick way to check for the new rows in the csv file and update the counters?
I cannot download the whole file everyday and import it to a database.
The data is available here: https://qsprod.saude.gov.br/extensions/covid-19_html/covid-19_html.html in the link Dados Completos
a 153468857093 byte CSV on S3.
The input file sample is available here: https://drive.google.com/file/d/1LRVJMKeE0wzuGshmfsI7pnfpHA800iph/view?usp=sharing