-1

I have several data sources that i'm trying to work with - i asked a related question a couple of days ago (click here!

so i have 3 dataframes each with a 'user_id' column that is common across all 3 data frames, but not all dataframes are exactly the same size.

I didn't realize it at first and used pd.concat combine them but they aren't lined up by user_id, and i'm not sure how to accomplish that.

here is some sample data from each, and sample data from the resulting concat (perhaps that is helpful?)

df1:

        user_id duration
    0   1000    116.830000
    1   1001    328.092000
    2   1002    259.043333
    3   1003    1041.000000
    4   1004    327.368750
    5   1005    470.220000
    6   1006    32.055000
    7   1007    496.830000
    8   1008    491.103333
    9   1009    698.710000

df2:

user_id mb_used
0   1000    1902.000000
1   1001    16088.200000
2   1002    13432.000000
3   1003    27045.000000
4   1004    19544.500000
5   1005    17141.000000
6   1006    17094.000000
7   1007    28770.800000
8   1008    18491.333333
9   1009    23405.125000

df3:

    user_id id
0   1000    11.000000
1   1001    41.400000
2   1002    29.333333
3   1003    50.000000
4   1004    22.125000
5   1005    11.000000
6   1006    77.000000
7   1007    51.000000
8   1008    28.000000
9   1011    53.000000

df 4 = pd.concat([df1,df2,df3],axis=1)

df4 result:

   user_id     duration  user_id       mb_used  user_id         id
0   1000.0   116.830000     1000   1902.000000   1000.0  11.000000
1   1001.0   328.092000     1001  16088.200000   1001.0  41.400000
2   1002.0   259.043333     1002  13432.000000   1002.0  29.333333
3   1003.0  1041.000000     1003  27045.000000   1003.0  50.000000
4   1004.0   327.368750     1004  19544.500000   1004.0  22.125000
5   1005.0   470.220000     1005  17141.000000   1005.0  11.000000
6   1006.0    32.055000     1006  17094.000000   1006.0  77.000000
7   1007.0   496.830000     1007  28770.800000   1007.0  51.000000
8   1008.0   491.103333     1008  18491.333333   1008.0  28.000000
**9   1009.0   698.710000     1009  23405.125000   1011.0  53.000000**

is there something i've done wrong or could add to line them by that shared user_id or should i be using a different method? i'll be honest - i started with pd.merge but quickly realized i was in over my head in trying to structure that, but if that is the only way (or the best way) i'll take another crack at it.

thanks in advance for your time, and i apologize for what is likely a lack of proper terminology, i am quite new at python (and programming in general)

seuadr
  • 141
  • 11
  • `from functools import reduce` `dfs = [df1, df2, df3]` `df_final = reduce(lambda left,right: pd.merge(left,right,on='user_id'), dfs)` – David Erickson Jan 06 '21 at 01:22
  • is `user_id` unique in each of these dataframes? – tdelaney Jan 06 '21 at 01:31
  • @tdelaney the user_id is the value across all three dataframes, they are separate usages for the same user. – seuadr Jan 06 '21 at 15:18
  • But that is not what I asked. Take `user_id mb_used`. Can the same user id be in there many times, e.g., mb_used on several different media? In that case, there isn't a one-to-one correspondence between rows in these dataframes. In this example for instance, perhaps rows need to be summed to make user_id unique in the dataframe. – tdelaney Jan 06 '21 at 16:15
  • @tdelaney i apologize, i misunderstood. no, user_id's do not appear more than once, but not all user IDs are present in all dataframes. This created some NaNs when i used the initially suggested merge – seuadr Jan 06 '21 at 16:33
  • Thanks for the update. The NaN's are added whenever there is missing information. You can do `how="inner"` to drop rows that would otherwise have NaN in them , otherwise that's just what is used with missing data. You can change them to a default `df[df.isnull()] = 0` and if you ended up with floats when you wanted ints (NaN forces int to flota), then `df.astye(int)`. – tdelaney Jan 06 '21 at 16:40
  • @tdelaney thanks for the additional info! this is all new to me and sometimes i struggle to understand what to do - my excel-fu is strong, but my python is weak! :D – seuadr Jan 06 '21 at 17:05

3 Answers3

1

Use merge instead:

df4 = df1.merge(df2, on = "user_id").merge(df3, on = "user_id")

Pablo C
  • 4,661
  • 2
  • 8
  • 24
  • 1
    worked perfectly, thanks! i know this was probably a really basic question so i appreciate your time. – seuadr Jan 06 '21 at 15:23
1

Reindex the dataframes by user_id and then concat on the column axis. Concatenation will happen on the row label (that is, the index you set) with unmatched user ids getting NaN.

df4 = pd.concat([d.set_index('user_id') for d in [df1, df2, df3]], axis=1)
tdelaney
  • 73,364
  • 6
  • 83
  • 116
1

If you want to keep user_id of all 3 dataframes, this can be accomplished using merge and outer join as follows:

df4 = pd.merge(df1, df2, how="outer", on="user_id").merge(df3, how="outer", on="user_id")
ZealousWeb
  • 1,647
  • 1
  • 10
  • 11
  • This worked great, thanks! i know this is a pretty basic question, so i am thankful of your time and effort. – seuadr Jan 06 '21 at 15:23
  • just for anyone else that might be looking at this - Ultimately, i used ZealousWeb's solution because it contains an outer join and i discovered that not all user_ids appeared in all dataframes, creating some NaNs. including how="outer" kept these records from being dropped so was, in my opinion, the best solution. – seuadr Jan 06 '21 at 16:35