0

I have two rather large Pandas dataframes, like ~300k rows, one with N~30 columns (including a "Description" one), the other with only one "Description" column. The first df contains full data but with some lines too many (a few thousand) and some other missing lines (some 600), roughly uniformly spread throughout it; the other one correctly specifies the lines but doesn't have any further data. I would like to merge them into a single dataframe that has the lines specified by the latter and the information of the former (i.e. dropping rows that do not appear in the latter and creating new rows where needed, say filled with nan's). So the final dataframe should have size M x N, were M is the number of rows of the second dataframe.

This would be a prototypical case for pandas.merge, were it not for the fact that the keys on which I want to merge have duplicates. They are "mostly unique", but some entries repeat (typically after no less than 100 lines), and I don't want them to appear multiple times. Instead, I would like to do something that looks like "version control", namely compare line by line to get the best match of the two descriptions, remove the lines added in the first dataframe and push the new ones from the right ones. Indeed, if I only extract the "Description" column of both df in two CSV, I can easily do what I need by hand with e.g. meld (the lines are long enough that there is really no ambiguity in how to check equal lines), but 1) this is slow and 2) it doesn't solve my problem (in the end I would just get a duplicate of the second dataframe, as I had to drop the other columns to use meld).

To exemplify, say that I have the following two dataframes:

df1 = pd.DataFrame({'Description': ['A','B','Y','D','A','E','F','Y','B'], 'Values': np.arange(9)**2})
df2 = pd.DataFrame({'Description': ['A','B','D','A','E','B','F','B']})

>>> df1
    Description Values
0   A   0
1   B   1
2   Y   4
3   D   9
4   A   16
5   E   25
6   F   36
7   Y   49
8   B   64
>>> df2
    Description
0   A
1   B
2   D
3   A
4   E
5   B
6   F
7   B

The one I would like to obtain is:

    Description Values
0   A   0.0
1   B   1.0
2   D   9.0
3   A   16.0
4   E   25.0
5   B   NaN
6   F   36.0
7   B   64.0

The only kind-of-solution I found is to add a column to the first dataframe replicating the index by simply

df1['id'] = df1.index

(that I know is the standard, i.e. np.arange(len(df1))), use pd.merge(df1, df2, how='right'), and then extract from the merged dataframe the indices corresponding to a maximal increasing sequence of the column id. For this, see e.g. Longest increasing subsequence that would need to be adapted in order to always include any NaN in the sequence.

However, before reinventing the wheel, I was wondering if somebody knows an already existing implementation of this code, perhaps in some package that is used for automated version control?

EDIT: in a previous version of the question df2 was df2 = pd.DataFrame({'Description': ['A','B','D','A','E','Z','F','B']}). Reading the first answer by @jezrael, I realized that it suggested that the additional lines are "special" by the letter Z, but in practice they will appear elsewhere in the dataframes. This made @jezrael propose a very nice and elegant answer, but that unfortunately does not work for me.

On the other hand, for a slightly simplifying hypothesis, in practice the rows to drop are rather special, so I renamed them to 'Y' because I could remove them beforehand and ignore this part of the question - the "increasing" part is what I really care about.

Community
  • 1
  • 1
Marco Spinaci
  • 1,750
  • 15
  • 22

1 Answers1

1

I think you need create new columns with different values where duplicates by cumcount:

df1['a'] = df1.groupby('Description').cumcount()
df2['a'] = df2.groupby('Description').cumcount()

print (df1)
  Description  Values  a
0           A       0  0
1           B       1  0
2           C       4  0
3           D       9  0
4           A      16  1
5           E      25  0
6           F      36  0
7           A      49  2
8           B      64  1
print (df2)
  Description  a
0           A  0
1           B  0
2           D  0
3           A  1
4           E  0
5           Z  0
6           F  0
7           B  1

print (pd.merge(df1, df2, how='right', on=['Description','a']))
      Description  Values  a
0           A     0.0  0
1           B     1.0  0
2           D     9.0  0
3           A    16.0  1
4           E    25.0  0
5           F    36.0  0
6           B    64.0  1
7           Z     NaN  0

If order is important, add reset_index, sort_values and last for drop helper columns :

df = pd.merge(df1, df2.reset_index(), how='right', on=['Description','a'])
df = df.sort_values('index').drop(['a','index'], axis=1)
print (df)
      Description  Values
0           A     0.0
1           B     1.0
2           D     9.0
3           A    16.0
4           E    25.0
7           Z     NaN
5           F    36.0
6           B    64.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for the quick answer! Unfortunately, even though this solves the example I wrote, I fear the general case has one more difficulty that makes this solution not work: the additional rows in df2 may well be already appearing somewhere else (in df1 and df2). E.g. df2 = pd.DataFrame({'Description': ['A','B','D','A','E','C','F','B']}). In this case, your solution would give the C value to that line, instead of NaN (and if there are other C further down it will break all the matches as the cumcount() differ from there on) – Marco Spinaci Mar 07 '17 at 12:35
  • 1
    :( `merge` is not friend with `duplicated` values, now i have no other idea. Can you change sample in question by values in comment? – jezrael Mar 07 '17 at 12:39
  • Edited the question, thanks! And also thanks a lot for the answer anyways, I learned the cumcount() command today :-) – Marco Spinaci Mar 07 '17 at 12:47