I am trying to create a function that will take in CSV files and create dataframes and concatenate/sum like so:
id number_of_visits
0 3902932804358904910 2
1 5972629290368575970 1
2 5345473950081783242 1
3 4289865755939302179 1
4 36619425050724793929 19
+
id number_of_visits
0 3902932804358904910 5
1 5972629290368575970 10
2 5345473950081783242 3
3 4289865755939302179 20
4 36619425050724793929 13
=
id number_of_visits
0 3902932804358904910 7
1 5972629290368575970 11
2 5345473950081783242 4
3 4289865755939302179 21
4 36619425050724793929 32
My main issue is that in the for loop after I create the dataframes, I tried to concatenate by df += new_df
and new_df
wasn't being added. So I tried the following implementation.
def add_dfs(files):
master = []
big = pd.DataFrame({'id': 0, 'number_of_visits': 0}, index=[0]) # dummy df to initialize
for k in range(len(files)):
new_df = create_df(str(files[k])) # helper method to read, create and clean dfs
master.append(new_df) #creates a list of dataframes with in master
for k in range(len(master)):
big = pd.concat([big, master[k]]).groupby(['id', 'number_of_visits']).sum().reset_index()
# iterate through list of dfs and add them together
return big
Which gives me the following
id number_of_visits
1 1000036822946495682 2
2 1000036822946495682 4
3 1000044447054156512 1
4 1000044447054156512 9
5 1000131582129684623 1
So the number_of_visits
for each user_id
aren't actually adding together, they're just being sorted in order by number_of_visits