Data treatment automation in Python with Pandas

Question

TL;DR Have lots of data to be treated in a consistent way. Looking for automation solutions

Hi everyone, I'm dealing with datasets from 2013 to 2022, regarding forest fires in Portugal. The main objective is to provide the general public with official information regarding number of fires, burnt area, and other indicators that allow for a discussion of this issue based on facts, not on speculation.

The thing is that I'm having problems in creating a workflow that will not have me writing the same code for all the dataframes.

I would like to know if there is method where I can turn this:

total_records_2022 = df_in_2022['id'].nunique()
total_records_2021 = df_in_2021['id'].nunique()
total_records_2020 = df_in_2020['id'].nunique()
total_records_2019 = df_in_2019['id'].nunique()
total_records_2018 = df_in_2018['id'].nunique()
total_records_2017 = df_in_2017['id'].nunique()
total_records_2016 = df_in_2016['id'].nunique()
total_records_2015 = df_in_2015['id'].nunique()
total_records_2014 = df_in_2014['id'].nunique()
total_records_2013 = df_in_2013['id'].nunique()

or this

# GET TOTAL BURNT AREA FOR EACH YEAR

total_burn_area_measure = " ha"

df_in_2022_reset_burntarea = df_in_2022['icnf.burnArea.total'].fillna(0)
total_burnt_area_2022_number_full = df_in_2022['icnf.burnArea.total'].sum()
total_burnt_area_2022_number = "{:.2f}".format(total_burnt_area_2022_number_full)
total_burnt_area_2022 = total_burnt_area_2022_number + total_burn_area_measure

into a loop that runs all the dataframes and applies whatever data treatment I need. The full code can be found here and, as you can see, there is a lot of data to be treated and but in a consistent way.

Any help, or guidance, you could give is much appreciated.

Full disclosure: This code will be used for non-commercial purposes.

Why are you reading your data into different variables without any structure? You could have `csv_files = {2022: 'file_to_2022_data.csv', ...}` and `total_records = { year: pd.read_csv(filepath) for year, filepath in csv_files.items()}` or something like that. Then looping is obvious? — Quang Hoang, Aug 25 '22 at 17:09
@QuangHoang I'm doing it the best way I know, and that is why I wrote a question asking for help/guidance. I'm fairly new to coding and don't grasp, yet, many of the underlying concepts. — Jorge Gomes, Aug 25 '22 at 17:12
@QuangHoang In its current form, it would be closed on Code Review. — Mast, Aug 25 '22 at 17:24
@JorgeGomes, this might help you: https://stackoverflow.com/q/6181935/14627505. Try using a dictionary where you store all your data (keys would be years). Also, here is [How to Iterate Through a Dictionary](https://realpython.com/iterate-through-dictionary-python/) — Vladimir Fokow, Aug 25 '22 at 17:25
@QuangHoang That's a Stack Overflow problem. Don't redirect crap. — Mast, Aug 25 '22 at 18:23

score 0 · Accepted Answer · answered Aug 26 '22 at 07:27

So I came up with this solution, after browsing a lot. Might not be the best way, I don't know, but it works.

# import libraries 
import json
import requests
import pandas as pd 
import datetime as dt 
from datetime import datetime, timedelta, date 

# Define Arrays

csv_git = [2013,2014,2015,2016,2017,2018]
csv_fogos = [2019,2020,2021]

years = [2013,2014,2015,2016,2017,2018,2019,2020,2021,2022]


# -------------------------------------
#     GET INITIAL DATA - ALL FOR 2022
# -------------------------------------

url_bar_2022 = "https://api.fogos.pt/v2/incidents/search?after=2022-01-01&limit=1000000"
# Get response from URL 
response_2022 = requests.get(url_bar_2022)
json_2022 = response_2022.json()
# Create dataframe for 2022 and treat the data 
df_in_2022 = pd.json_normalize(json_2022,'data')
df_in_2022.loc[:,'date'] = pd.to_datetime(df_in_2022['date'],format='%d-%m-%Y')
df_in_2022['month'] = pd.DatetimeIndex(df_in_2022['date']).month
df_in_2022 = df_in_2022.sort_values(by='district', ascending=True)

#df_in_ = {}

for i in csv_git:
    globals()[f"df_in_{i}"] = pd.read_csv(f'https://raw.githubusercontent.com/vostpt/ICNF_DATA/main/icnf_{i}_raw.csv')
    
for i in csv_fogos:
    globals()[f"df_in_{i}"] = pd.read_csv(f'assets/fogos_{i}.csv')


for i in csv_git:
    globals()[f"df_in_{i}"] = globals()[f"df_in_{i}"].rename(columns={"Unnamed: 0": "id","MES":"month","DISTRITO":"district","ANO":"year","CONCELHO":"concelho","AREATOTAL":"icnf.burnArea.total"})



for i in years:
    globals()[f"total_records_{i}"] = globals()[f"df_in_{i}"]['id'].nunique()


print(total_records_2022)

Data treatment automation in Python with Pandas

1 Answers1