0

I am trying to create a function that will take in CSV files and create dataframes and concatenate/sum like so:

    id     number_of_visits
0   3902932804358904910  2
1   5972629290368575970  1
2   5345473950081783242  1
3   4289865755939302179  1
4   36619425050724793929 19

+ 

    id     number_of_visits
0   3902932804358904910  5
1   5972629290368575970  10
2   5345473950081783242  3
3   4289865755939302179  20
4   36619425050724793929 13

=

    id     number_of_visits
0   3902932804358904910  7
1   5972629290368575970  11
2   5345473950081783242  4
3   4289865755939302179  21
4   36619425050724793929 32

My main issue is that in the for loop after I create the dataframes, I tried to concatenate by df += new_df and new_df wasn't being added. So I tried the following implementation.

def add_dfs(files):
    master = []
    big = pd.DataFrame({'id': 0, 'number_of_visits': 0}, index=[0]) # dummy df to initialize
    for k in range(len(files)):
        new_df = create_df(str(files[k])) # helper method to read, create and clean dfs
        master.append(new_df) #creates a list of dataframes with in master
    for k in range(len(master)):
        big = pd.concat([big, master[k]]).groupby(['id', 'number_of_visits']).sum().reset_index()
        # iterate through list of dfs and add them together
    return big

Which gives me the following

    id   number_of_visits
1   1000036822946495682 2
2   1000036822946495682 4
3   1000044447054156512 1
4   1000044447054156512 9
5   1000131582129684623 1

So the number_of_visits for each user_id aren't actually adding together, they're just being sorted in order by number_of_visits

Wiseface
  • 183
  • 2
  • 11
  • "So the number_of_visits for each user_id aren't actually adding together, they're just being sorted in order by number_of_visits" Okay, so what do you *want* to happen instead, and how is that different? – Karl Knechtel Mar 25 '21 at 04:19
  • @KarlKnechtel I showed what I'm looking for in the beginning of the post, but to recap: I want any id with more than one entry to only show once with the corresponding number of visits to be added. For example: If dataframe A has an id of 100 with 5 visits and dataframe B also has an id of 100 with 10 visits, when I add A and B together, I want to see id 100 with 15 visits. What I would get with my function is both entries id 100 showing up sequentially – Wiseface Mar 25 '21 at 04:27
  • Does https://stackoverflow.com/questions/58218387/pandas-merge-rows-with-same-id help? – Karl Knechtel Mar 25 '21 at 04:29
  • It's not helpful because I'm not just looking to drop duplicates; I need the values in number_of_visits to be added for any duplicate, while also dropping the duplicates – Wiseface Mar 25 '21 at 04:36
  • If you can *detect* duplicates and *group* them - which is part of the task of removing them - then you can further process the groups, for example by summing values from within them. – Karl Knechtel Mar 25 '21 at 07:39

2 Answers2

1

Pass your list of dataframes directly to concat() then group on the id and sum.

>>> pd.concat(master).groupby('id').number_of_visits.sum().reset_index()
                     id  number_of_visits
0  36619425050724793929                32
1   3902932804358904910                 7
2   4289865755939302179                21
3   5345473950081783242                 4
4   5972629290368575970                11
def add_dfs(files):
    master = []
    for f in files:
        new_df = create_df(f) 
        master.append(new_df) 
    big = pd.concat(master).groupby('id').number_of_visits.sum().reset_index()
        
    return big
0

You can use

df1['number_of_visits'] += df2['number_of_visits']

this gives you:

|    |                   id |   number_of_visits |
|---:|---------------------:|-------------------:|
|  0 |  3902932804358904910 |                  7 |
|  1 |  5972629290368575970 |                 11 |
|  2 |  5345473950081783242 |                  4 |
|  3 |  4289865755939302179 |                 21 |
|  4 | 36619425050724793929 |                 32 |
Albo
  • 1,584
  • 9
  • 27