1

i have 2 data frames which are group and cite

assume that this is my df_group

label groupId 
1       123
2       124
3       125
4       126
5       127

and df_cite

groupId new_group
123       96
124       96
125       96
123       97
124       99
124       98
125       98
126       97
127       99

i would like to see the new df_group results as

df_group (new)

label groudId new_group
1      123     96
2      123     97
3      124     96
4      124     98
5      124     99
6      125     96
7      125     98
8      126     97
9      127     99

i have tried test_out = df_group.merge(df_cite, left_on='groupId', right_on='groupId') and df_group = df_group.join(df_cite.set_index('groupId'), on=['PatNumgroupId']) but both are not working.

further to this Python: how to merge two dataframes on a column by keeping the information of the first one?, i have followed but got the InvalidIndexError: Reindexing only valid with uniquely valued Index objects instead

  • Not sure your output makes sense in the question - why does `label` 4 now have `groudId` 124 for example? – Toby Petty Oct 30 '18 at 13:30
  • actually, the label is an index. – kang xiuriu Oct 30 '18 at 13:32
  • In that case what's the reason for wanting to merge, since `df_group` doesn't contain any additional information? What is it you're trying to achieve? – Toby Petty Oct 30 '18 at 13:35
  • i would like to merge the new_group from df_cite to the df_group to know the citation in df_group – kang xiuriu Oct 30 '18 at 13:37
  • The problem is that you can not have duplicated values in the index. Your label column will have duplicates after the merge. Try this: df_group.merge(df_cite, on ='groupId', how = 'outer') You have the same column nanme in both df therefore no need to use right and left on. You need the outer option to match all values. Do not try to set the new index on label. – Jorge Oct 30 '18 at 13:41
  • thank you for all suggestions, i also have tried df_group.merge(df_cite, on ='groupId', how = 'outer') this but it appears 'ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat'. I am a new beginner user and don't know how to fix it :( – kang xiuriu Oct 30 '18 at 13:44

2 Answers2

1

I believe you need to create helper columns with cumcount for counter of duplicated values and merge with left join by columns groupId with g, last remove helper column by drop:

For merge need same types of merging columns, so is possible convert both to integers or both to strings

#solution 1 
df_group['groupId'] = df_group['groupId'].astype(int)
#solution 2
#df_cite['groupId'] = df_cite['groupId'].astype(str)

df_group['g'] = df_group.groupby('groupId').cumcount()
df_cite['g'] = df_cite.groupby('groupId').cumcount()

test_out = df_group.merge(df_cite, on=['groupId','g'], how='left').drop('g', axis=1)
print (test_out)
   label  groupId  new_group
0      1      123         96
1      2      124         96
2      3      125         96
3      4      126         97
4      5      127         99
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thank you so much for your suggestion. but the value error suggest me to use pd.concat T T – kang xiuriu Oct 30 '18 at 13:47
  • @kangxiuriu - Answer was edited, how working `df_group['groupId'] = df_group['groupId'].astype(int)` before my solution ? – jezrael Oct 30 '18 at 13:51
  • Be aware with this response you are missing values in label. It needs to be 'outer' in the how option to include all values from both dataframes. – Jorge Oct 30 '18 at 14:06
  • @Jorge - Can you explain more? Because I think not. – jezrael Oct 30 '18 at 14:07
  • In df_cite you have several values in new_group for the same group_id (e.g. group_id 123 has new_group 96 and 97). by doing how = 'left' you only get one of those. In your response only 96. If you do how = 'outer' you get both values for new_group. If that is what @Kang xiuriu wants. – Jorge Oct 30 '18 at 14:09
  • @Jorge - you are right, if not use `cumcount` for new columns `g` for both df like my answer. If use it, no problem. – jezrael Oct 30 '18 at 14:11
0

You probably want to do:

df_cite = df_cit.reset_index(drop = False)

and

df_group = df_group.reset_index(drop = False)

to set a new index on each dataframe. It is not clear from your question whether the dfs have a 'regular' index or if you already set the index to one of the columns.

If is the second case, when you do the merge, the command is not finding the column, because it is the index.

This is how your dataframe with a 'normal' index should look like:

    label   groupId
0   1   123
1   2   124
2   3   125
3   4   126
4   5   127

When compared to your question, the above df has an 'extra' column to the left. That is the index. In your case it appears that 'label' is the name of your index not a column in df_group.

It also seems that you may have your groupId as different types (object and integer -int)) in each df. You want to check that with df_cite.info() and df_group.info(). If they are columns, they should show up in a list and both should have the same type of data:

    df_cite.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 9 entries, 0 to 8
    Data columns (total 2 columns):
    groupId      9 non-null int64
    new_group    9 non-null int64
    dtypes: int64(2)
    memory usage: 224.0 bytes

In this case groupId is an integer (int64)

Jorge
  • 2,181
  • 1
  • 19
  • 30