1

I have multiple pandas dataframes (5) with some common names index. They have different size. I need sum at least 5 different common colum names (25 in total) from each dataframe and then compare the sums.

Data:
df_files = [df1, df2, df3, df4, df5]
df_files
out:
    [         z          name  ...             a         b
0            10           DAD  ...             4        4
1            10           DAD  ...             5        4
2            10           DAD  ...             3        6
3            10           DAD  ...             9        2
4            10           DAD  ...            11        1
...         ...           ...  ...            ...       ...
7495       <NA>           NaN  ...             2        0
7496       <NA>           NaN  ...             5        3
7497       <NA>           NaN  ...             3        1
7498       <NA>           NaN  ...             2        0
7499       <NA>           NaN  ...             4        3

[7500 rows x 35 columns] #The dataframes are like this type but some vary in size.

What I need is to sum some specific common names and then compare those sums to see if they match and if they do print out and OK, and if they do not see which value is not macthing with the others like: The value from "column name" from df3 and df4 do not match with the others values and see the expected common value (when the other majority match) and the colums that match(or if they just match do not need to show it, just the common expected value). Maybe some of them will not match each other but the common expected value it will be assume as the value that is repeated the most and if any of them match that print out that the values needs correction to proceed because any of them match and see the values that not match.

I was begining with something like this:

   Example:
     df = pd.concat([df1["a"].sum(), df2["a"].sum(), df3["a"].sum(), df4["a"].sum(), df5["a"].sum()])
     df
     out:
         a       a      a      a      a  
     0  425     425    426    427    425

or maybe they can be compared as a list of integers.

I will appreciate your attention with this question. I hope I have been specific.

Richard21
  • 157
  • 1
  • 9
  • `[file['A'].sum().values for file in df_files]`? – It_is_Chris Oct 15 '20 at 20:08
  • Please *show* and not *tell* us your current data and desired results. Human language tends to be imprecise. See [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391/1422451). – Parfait Oct 15 '20 at 20:11
  • @It_is_Chris Thanks for your answer. I guess that will be like: `[file['A'].values.sum() for file in df_files]`. But then how can I accomplish the comparations? – Richard21 Oct 15 '20 at 20:24

1 Answers1

1

If I understand your problem correctly, you need to bring the names of data frames and respective columns in one place to compare the sums. In that case I usually use a dictionary to keep the name of variable, something like this:

df_files = {'df1':df1, 'df2':df2, 'df3':df3, 'df4':df4, 'df5':df5}
summary = pd.DataFrame()
for df in df_files.keys():
    cols = list(summary)
    summary= pd.concat([summary, df_files[df].sum()], axis=1)
    summary.columns = cols + [df]
summary = summary.dropna()

The summary will be a data frame with common column names as index, and data frame names as columns. If you have only 5 dfs with 5 common column names it will be an easy job to observe the results. Here is a sample result I ran for 3 dfs:

    df1   df2   df3
a   6.0  10.0   6.0
b  15.0  14.0  15.0

But if the numbers grow, you can use the 'mode' of each row to find the most frequent result, and compare the rows (maybe divide all values and look for non-1 results)

Yashar
  • 762
  • 8
  • 16
  • Thanks for answering @Yashar. I works for me. Now, I only need to make comparations of columns to see the differences. Thanks a lot. – Richard21 Oct 15 '20 at 23:49