I have a big data set of 2500000 rows with the following format:
- Merkmal == Feature
- Auspraegung_Code == Code for the following column
- Auspraegung_Text == Actual kind in the Feature
- Anzahl == Number of kinds of this Feature
The rest is not from interest/self-explanatory.
My issue is that I'd like to have this DataFrame() with the Auspreagung_Text entries as columns and their Number/Amount (Anzahl column) for each Gitter_ID in each row.
Currently what I do is this:
df_result = pd.DataFrame()
for i,ids in enumerate(Gitter_ids):
auspraegungen = df["Auspraegung_Text"][df["Gitter_ID_100m_neu"]==ids ]
auspraegung_amounts= df["Anzahl"][df["Gitter_ID_100m_neu"]==ids ]
df_result.loc[i,"Cell_id"] = ids
for auspraegung,amounts in zip(auspraegungen,auspraegung_amounts):
df_result.loc[i,auspraegung] = anzahl
Result DataFrame() should look like this:
The code above is working, but is very very slow. How can i optimize the process?
The Data used in this problem is census data from germany.