Combine data from lines in a pandas df where client_id is duplicated

Question

I have combined 2 dataframes and I want to make sure that the user id is not repeated and all values are written in 1 line, and not duplicated. This is a ready-made dataframe, but I need to edit it and remove repetitions. I attach the input data:

rzd = pd.DataFrame({
    'client_id': [111, 112, 113, 114, 115],
    'rzd_revenue': [1093, 2810, 10283, 5774, 981]})
rzd

auto = pd.DataFrame({
    'client_id': [113, 114, 115, 116, 117],
    'auto_revenue': [57483, 83, 912, 4834, 98]})
auto

air = pd.DataFrame({
    'client_id': [115, 116, 117, 118],
    'air_revenue': [81, 4, 13, 173]})
air

client_base = pd.DataFrame({
    'client_id': [111, 112, 113, 114, 115, 116, 117, 118],
    'address': ['Комсомольская 4', 'Энтузиастов 8а', 'Левобережная 1а', 'Мира 14', 'ЗЖБИиДК 1', 
                'Строителей 18', 'Панфиловская 33', 'Мастеркова 4']})
client_base

frames = [rzd, auto, air]
result = pd.concat(frames)
result

full_result = result.merge(client_base)
full_result

    client_id  rzd_revenue  auto_revenue  air_revenue          address
0         111       1093.0           NaN          NaN  Комсомольская 4
1         112       2810.0           NaN          NaN   Энтузиастов 8а
2         113      10283.0           NaN          NaN  Левобережная 1а
3         113          NaN       57483.0          NaN  Левобережная 1а
4         114       5774.0           NaN          NaN          Мира 14
5         114          NaN          83.0          NaN          Мира 14
6         115        981.0           NaN          NaN        ЗЖБИиДК 1
7         115          NaN         912.0          NaN        ЗЖБИиДК 1
8         115          NaN           NaN         81.0        ЗЖБИиДК 1
9         116          NaN        4834.0          NaN    Строителей 18
10        116          NaN           NaN          4.0    Строителей 18
11        117          NaN          98.0          NaN  Панфиловская 33
12        117          NaN           NaN         13.0  Панфиловская 33
13        118          NaN           NaN        173.0     Мастеркова 4

The numbers are different documentation, by type merge, concat, and so on, but I did not find clear articles on editing.

As a result, it should look like this:

   client_id  rzd_revenue  auto_revenue  air_revenue          address
0        111       1093.0           0.0          0.0  Комсомольская 4
1        112       2810.0           0.0          0.0   Энтузиастов 8а
2        113      10283.0       57483.0          0.0  Левобережная 1а
3        114       5774.0          83.0          0.0          Мира 14
4        115        981.0         912.0         81.0        ЗЖБИиДК 1

@mozway Let's take, for example, user 114 and see that first there is a line where he has a value in rzh_revenue, and then there is a line where there is a value in auto_revenue, and I want to make these values in one line. — Follin, Nov 04 '22 at 19:49
@Follin Do you want to put the `air_revenue` in the same line too? Please make a [mre] including desired output as well as the input data. For specifics, see [How to make good reproducible pandas examples](/q/20109391/4518341). You are using Pandas, right? BTW, for more tips, like how to write a good title, see [ask]. — wjandrea, Nov 04 '22 at 20:03
@wjandrea Yes, as soon as I could, I attached the table that should be in the end. I use pandas, — Follin, Nov 04 '22 at 20:15
@wjandrea, thank you for the corrections and tips on creating questions. I won't make such mistakes when creating questions anymore. :) — Follin, Nov 04 '22 at 20:36

score 0 · Answer 1 · answered Nov 04 '22 at 20:25

0

If possible, you should prefer joining so that there isn't a duplicate row issue at all. However, if you're just given full_result and not the initial tables, this can be accomplished using a groupby expression:

full_result.groupby("client_id").first().reset_index()

For each client_id, it takes the first non-null value of each column, then resets the index so client_id is just another column again.

answered Nov 04 '22 at 20:25

Eric Pauley

1,709
1
20
30

I got ```full_result``` by combining the first 3 give.(rzd, auto and air) – Follin Nov 04 '22 at 20:30

wjandrea · Answer 2 · 2022-11-05T18:45:43.490

You should be joining. You can use concat() to do that, just need to set the index first:

result = pd.concat([df.set_index('client_id') for df in frames], axis=1)
result

           rzd_revenue  auto_revenue  air_revenue
client_id                                        
111             1093.0           NaN          NaN
112             2810.0           NaN          NaN
113            10283.0       57483.0          NaN
114             5774.0          83.0          NaN
115              981.0         912.0         81.0
116                NaN        4834.0          4.0
117                NaN          98.0         13.0
118                NaN           NaN        173.0

Then to connect client_base, you can do something similar. I'll also replace the NaNs now.

full_result = result.join(client_base.set_index('client_id')).fillna(0)
full_result

           rzd_revenue  auto_revenue  air_revenue          address
client_id                                                         
111             1093.0           0.0          0.0  Комсомольская 4
112             2810.0           0.0          0.0   Энтузиастов 8а
113            10283.0       57483.0          0.0  Левобережная 1а
114             5774.0          83.0          0.0          Мира 14
115              981.0         912.0         81.0        ЗЖБИиДК 1
116                0.0        4834.0          4.0    Строителей 18
117                0.0          98.0         13.0  Панфиловская 33
118                0.0           0.0        173.0     Мастеркова 4

At this point you could .reset_index() if you wanted.

More info:

User guide: Merge, join, concatenate and compare
Pandas Merging 101

Many thanks for the alternative solution and for the attached sections for further study. Grateful. — Follin, Nov 06 '22 at 20:46

Combine data from lines in a pandas df where client_id is duplicated

2 Answers2