0

I have the same dataset but over different weeks (so later weeks contain new rows). I want to append the new rows to the original dataframe to create one big dataframe with all unique rows and no duplicates. I can't just take the last week because some get deleted over the weeks.

I tried to use the following code but somehow my final_info dataframe still contains some non-unique values

final_info = data[list(data.keys())[-1]]['all_info']
    for week in reversed(data.keys()):
    df_diff = pd.concat([data[week]['all_info'],final_info]).drop_duplicates(subset='project_slug', 
                                                                             keep=False)
    final_info = final_info.append(df_diff).reset_index(drop=True)

Does somebody see where it goes wrong?

Martijn
  • 105
  • 8
  • 2
    Please provide a [mcve], as well as the current and expected output. – AMC Oct 28 '20 at 22:45
  • Does this answer your question? [Concat python dataframes based on unique rows](https://stackoverflow.com/questions/44236940/concat-python-dataframes-based-on-unique-rows) – AMC Oct 28 '20 at 22:46

2 Answers2

3

if I understand your question, you are just trying to add the unique rows from one dataframe to another dataframe. I don't think there is any need to iterate through the keys like you are doing. There is an example on this question that I think can help you and i think it is conceptually easier to follow 1. I'll try to walk through an example to make it more clear.

So if you have a dataframe A:

col1  col2
1     2
2     3
3     4

and a dataframe B:

col1  col2
1     2
2     3
6     4

These two dataframes have the same first two rows but have different last rows. If you wanted to get all the unique rows into one dataframe you could first get all the unique rows from just one of the dataframes. So for this example you could get the unique row in dataframe B, lets call it df_diff in this example. The code to do this would be

df_diff = B[~B.col1.isin(A.col1)]

output: col1  col2
        6     4

This above line of code makes whats called a boolean mask and then negates using ~ so that you get all rows in dataframe B where the col1 value is not in dataframe A.

You could then merge this dataframe, df_diff, with the first dataframe A. We can call this df_full. This step is done with:

df_full = pd.concat([A, df_diff], ignore_index=True)

The ignore_index=True just resets the index of the resulting dataframe. This will give you:

col1  col2
1     2
2     3
3     4
6     4

Now the above dataframe has the new row in dataframe B plus the original rows from dataframe A.

I think this would work for your situation and may be less lines of code.

J.T. Baker
  • 73
  • 5
  • With some small adaptations this answers my question perfectly! In my case I did have to go trough the for loop in order to add the newest columns in a backwards fashion so that I did not lose any information (hard to explain the entire context in a comment haha). Thanks! – Martijn Oct 29 '20 at 11:32
0

There is also the method pandas.DataFrame.combine_first which does multiple things at the same time:

  • updates the first dataframes rows that contain null values with the same rows from the second dataframe if they do not contain null values
  • appends new rows to the first dataframe

Syntax: df_1.combine_first(df_2)

Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.combine_first.html

fabmeyer
  • 51
  • 1
  • 6