1

I cannot get this to come out as I expect, despite trying different variations of merging.

I would like to merge df's piat_comp and piat2 on piat_retest_id_2 but the following code, also tried as an inner join, does not line up as expected.

    piat_comp2 = piat_comp.merge(piat2, on='piat_retest_id_2', how='outer')

For instance, I am expecting index 8 of piat_comp to pair with index 9 of piat 2 because they both share the same id in 'piat_retest_id_2'. Instead they simply do not line up.

In the resultant dataframe those rows should merge to give:

piat_retest_id_1 ----- PIAT.ability ----- piat_retest_id_2 ----- PIAT.ability_2

UK_543970282 ----- 0.270247832130706 ----- UK_358142873 -----0.697223344529081

Notice that piat_retest_id_1 and piat_retest_id_2 are not expected to be the same. piat_retest_id_2 is in both df's and I want to merge on that, but only end up with one column of it.

Reproducible:

piat_comp = {'piat_retest_id_1': {0: 'UK_982598362',
  1: 'UK_554290758',
  2: 'UK_416473256',
  3: 'UK_717765782',
  4: 'UK_340767839',
  5: 'UK_393447950',
  6: 'UK_16532112',
  7: 'UK_887499021',
  8: 'UK_543970282',
  9: 'UK_86558749',
  10: 'UK_763524616',
  11: 'UK_59234772',
  12: 'UK_891870396',
  13: 'UK_416946337',
  14: 'UK_208235437',
  15: 'UK_697728998',
  16: 'UK_283032428',
  17: 'UK_927006735',
  18: 'UK_838548458',
  19: 'UK_935961427',
  20: 'UK_534651143',
  21: 'UK_469442638',
  22: 'UK_357516555',
  23: 'UK_869226039',
  24: 'UK_675587892',
  25: 'UK_331843371',
  26: 'UK_570264371',
  27: 'UK_818769311',
  28: 'UK_708776594',
  29: 'UK_811947376',
  30: 'AUS_182800035',
  31: 'AUS_436205218',
  32: 'AUS_519722760',
  33: 'AUS_948860279',
  34: 'AUS_273917058',
  35: 'AUS_174076947',
  36: 'AUS_462744594',
  37: 'AUS_718809552',
  38: 'AUS_142275638',
  39: 'AUS_646335936',
  40: 'AUS_775355864',
  41: 'AUS_523210619',
  42: 'AUS_242767995',
  43: 'AUS_485797627',
  44: 'AUS_192289763',
  45: 'AUS_971858664',
  46: 'AUS_272182103',
  47: 'AUS_6470694',
  48: 'AUS_960651008',
  49: 'AUS_36924588',
  50: 'AUS_921547727',
  51: 'AUS_469946434',
  52: 'AUS_591613124',
  53: 'AUS_575220499',
  54: 'AUS_767535190',
  55: 'AUS_6760645',
  56: 'AUS_33323414',
  57: 'AUS_453834043',
  58: 'AUS_576415932',
  59: 'AUS_359646415',
  60: 'AUS_647631159',
  61: 'AUS_535492855',
  62: 'AUS_488064434',
  63: 'AUS_230122761',
  64: 'AUS_996001169',
  65: 'AUS_830133002',
  66: 'AUS_82508132',
  67: 'AUS_922040010',
  68: 'AUS_754682968',
  69: 'AUS_79125001',
  70: 'AUS_581895146',
  71: 'AUS_100799448'},
 'PIAT.ability': {0: 3.8309313391931497,
  1: 1.99292247701631,
  2: 3.94306748097956,
  3: -0.19562132382300002,
  4: 0.6505183735635122,
  5: 0.731816086976127,
  6: 0.8137871489410059,
  7: 3.22777961014248,
  8: 0.270247832130706,
  9: 1.3424922846183798,
  10: -0.301396636649614,
  11: -1.9332907828737598,
  12: 0.917831930881864,
  13: 0.12488682346195801,
  14: 0.357263190639869,
  15: -0.771073750440309,
  16: 0.947225971482024,
  17: 1.09754260961856,
  18: 3.89556727394303,
  19: 0.000776309348843904,
  20: 0.633059075095001,
  21: -0.7613018941029259,
  22: 0.35232544411647604,
  23: 1.12508794021388,
  24: -0.689808173864224,
  25: 1.5182029956795,
  26: 0.8041551225465369,
  27: -0.49821628662950396,
  28: 1.0916595064983101,
  29: 3.6305272212891,
  30: -0.941759632235263,
  31: 0.8281845827398409,
  32: 0.38229348622269504,
  33: 1.53338336352128,
  34: 0.41351872726046995,
  35: 1.5705599264116397,
  36: 1.97524137560666,
  37: -0.00901326645847487,
  38: -0.6601342995997729,
  39: 0.764888906047256,
  40: 1.5069746380603897,
  41: 2.7867287535613703,
  42: -0.0572566797274714,
  43: 1.18340947496768,
  44: 0.21109444697996899,
  45: -0.11442018948849901,
  46: 0.260924655953102,
  47: 0.20230025184467001,
  48: -0.3911319281544,
  49: 1.66298720615184,
  50: 2.0562478076601596,
  51: -0.802324601478322,
  52: 0.0949041456718715,
  53: 0.39139573420752705,
  54: 0.0536255839392736,
  55: 1.32571209008263,
  56: 0.532393162267327,
  57: 0.269601175227049,
  58: 0.7076081560697621,
  59: 0.46112169587998597,
  60: -0.403030461213409,
  61: 0.43675734847674497,
  62: -0.879126649374346,
  63: -0.421994389129437,
  64: 1.77097042768753,
  65: -0.9110846702038922,
  66: -1.07482469386456,
  67: 3.7599324868259596,
  68: 0.359258100506062,
  69: 3.63222036081698,
  70: 0.389727330252132,
  71: 0.656178769811206},
 'piat_retest_id_2': {0: 'UK_873131581',
  1: 'UK_536050247',
  2: 'UK_288912',
  3: 'UK_601549101',
  4: 'UK_654724631',
  5: 'UK_927459150',
  6: 'UK_161992453',
  7: 'UK_898365573',
  8: 'UK_358142873',
  9: 'UK_749339291',
  10: 'UK_137508979',
  11: 'UK_802078996',
  12: 'UK_418768387',
  13: 'UK_538152168',
  14: 'UK_825889086.',
  15: 'UK_56800313',
  16: 'UK_90101007',
  17: 'UK_77933368',
  18: 'UK_288469917',
  19: 'UK_908770966',
  20: 'UK_661264755',
  21: 'UK_154114403',
  22: 'UK_826457452',
  23: 'UK_626934971',
  24: 'UK_852937537',
  25: 'UK_32221105',
  26: 'UK_516899485',
  27: nan,
  28: 'UK_213817864',
  29: 'UK_112922227',
  30: 'AUS_1828000352',
  31: 'AUS_4362052182',
  32: 'AUS_5197227602',
  33: 'AUS_9488602792',
  34: 'AUS_2739170582',
  35: 'AUS_1740769472',
  36: 'AUS_4627445942',
  37: 'AUS_7188095522',
  38: 'AUS_1422756382',
  39: 'AUS_6463359362',
  40: 'AUS_7753558642',
  41: nan,
  42: 'AUS_4857976272',
  43: 'AUS_1922897632',
  44: 'AUS_9718586642',
  45: 'AUS_2721821032',
  46: 'AUS_64706942',
  47: 'AUS_9606510082',
  48: nan,
  49: 'AUS_9215477272',
  50: 'AUS_5916131242',
  51: 'AUS_5752204992',
  52: 'AUS_7675351902',
  53: 'AUS_67606452',
  54: 'AUS_333234142',
  55: 'AUS_4538340432',
  56: 'AUS_5764159322',
  57: 'AUS_3596464152',
  58: 'AUS_6476311592',
  59: 'AUS_5354928552',
  60: 'AUS_4880644342',
  61: 'AUS_2301227612',
  62: 'AUS_9960011692',
  63: 'AUS_8301330022',
  64: 'AUS_825081322',
  65: 'AUS_9220400102',
  66: 'AUS_7546829682',
  67: 'AUS_791250012',
  68: 'AUS_5818951462',
  69: 'AUS_1007994482',
  70: nan,
  71: nan}}

  piat_comp = pd.DataFrame().from_dict(piat_comp)

    piat2 = {'piat_retest_id_2': {0: 'UK_873131581',
  1: 'UK_536050247',
  2: 'UK_288912',
  3: 'UK_601549101',
  4: 'UK_654724631',
  5: 'UK_927459150',
  6: 'UK_605723093',
  7: 'UK_161992453',
  8: 'UK_898365573',
  9: 'UK_358142873',
  10: 'UK_749339291',
  11: 'UK_137508979',
  12: 'UK_802078996',
  13: 'UK_418768387',
  14: 'UK_538152168',
  15: 'UK_56800313',
  16: 'UK_90101007',
  17: 'UK_77933368',
  18: 'UK_288469917',
  19: 'UK_908770966',
  20: 'UK_661264755',
  21: 'UK_154114403',
  22: 'UK_826457452',
  23: 'UK_852937537',
  24: 'UK_32221105',
  25: 'UK_516899485',
  26: 'UK_213817864',
  27: 'UK_112922227',
  28: 'AUS_1828000352',
  29: 'AUS_4362052182',
  30: 'AUS_5197227602',
  31: 'AUS_5782596212',
  32: 'AUS_9488602792',
  33: 'AUS_2739170582',
  34: 'AUS_1740769472',
  35: 'AUS_4627445942',
  36: 'AUS_7188095522',
  37: 'AUS_1422756382',
  38: 'AUS_6463359362',
  39: 'AUS_7753558642',
  40: 'AUS_4857976272',
  41: 'AUS_1922897632',
  42: 'AUS_9718586642',
  43: 'AUS_2721821032',
  44: 'AUS_64706942',
  45: 'AUS_9606510082',
  46: 'AUS_9215477272',
  47: 'AUS_4699464342',
  48: 'AUS_5916131242',
  49: 'AUS_5752204992',
  50: 'AUS_7675351902',
  51: 'AUS_67606452'},
 'PIAT.ability': {0: 4.0,
  1: 3.03585323865227,
  2: 3.0659628668283796,
  3: -0.984877273151229,
  4: 3.9214698118886804,
  5: 2.51547803673487,
  6: 1.1199284033743502,
  7: 0.776919848266456,
  8: 1.27559145269151,
  9: 0.697223344529081,
  10: 1.20628362697982,
  11: 1.3367662330669798,
  12: -0.862955270810047,
  13: 3.8621560811803697,
  14: 0.0635995324043882,
  15: -0.5886718197915529,
  16: -1.24297429168435,
  17: 3.5438183255585898,
  18: 4.0,
  19: 1.11913257500304,
  20: 0.7775141672201971,
  21: 1.1006024338623002,
  22: -0.24300164716209502,
  23: -0.494374704856605,
  24: 2.85613759713724,
  25: 1.3029798990839299,
  26: 0.0710404544671652,
  27: 4.0,
  28: 0.902265996923405,
  29: 1.4454497071397798,
  30: -1.2020915721021102,
  31: -1.8592336815873403,
  32: 1.61547349789574,
  33: 0.26694017591396,
  34: 2.44205760912957,
  35: 0.6621194092522099,
  36: -0.9020563539838921,
  37: 2.15717946932124,
  38: 1.16781892225373,
  39: 1.20554720958442,
  40: 1.7657674979533402,
  41: 0.40870760302225895,
  42: 0.0479209217874616,
  43: 1.6004708092039999,
  44: 0.19277544914143,
  45: 2.06504675907639,
  46: 2.95312483834473,
  47: -0.870384004997555,
  48: 0.7775511527014491,
  49: 1.3847884057742,
  50: 0.12852600156989802,
  51: 1.15655034425564}}

piat2 = pd.DataFrame().from_dict(piat2)
syntheso
  • 437
  • 3
  • 17
  • Please post the output you're getting in your question, as I am unable to reproduce. When I run the code as you posted it, I get the desired output of `8 UK_543970282 0.270248 UK_358142873 0.697223` – G. Anderson May 28 '19 at 16:23
  • Okay, it actually happens later in the dataset so I posted the whole thing now, but that example still demonstrates what I would like to happen in principle. You can see where it hasn't worked with the AUS_ ids which are the exact same ID for piat_retest_id_1 and piat_retest_id_2 but with a 2 at the end. So AUS_1234567 should end up next to AUS_12345672 (not actually in the dataset, but as an example) – syntheso May 28 '19 at 16:33
  • Then this is an [xy problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). The issue isn't with the merge, it's with the fact that the IDs are different. Merge happens on exact matches. If you want to merge on similar but not exact matches, then there is a different question to ask. – G. Anderson May 28 '19 at 16:58
  • No you are misunderstanding me. They are exact matches, I was just pointing out that you can check quickly if the merge has worked correctly because the AUS_ strings will be the same except the second ID will have a 2 at the end. – syntheso May 28 '19 at 17:05
  • It's still not clear where the issue lies. If I do a merge on the ID as described, I get, for example, `AUS_33323414 0.532393 AUS_5764159322 NaN` and `AUS_767535190 0.053626 AUS_333234142 NaN`, because those test_id_1 values are associated with those test_id_2 values, and are `NaN` because neither of them have test_id_2 values that exist in `piat2`. I also get `AUS_775355864 1.506975 AUS_7753558642 1.205547`, because that has an entry in both tables for both `ability` values. Which of these entries is problematic? – G. Anderson May 28 '19 at 17:28
  • Yes, I too get that. And those are fine. But then I also get AUS_192289763 0.211094 with AUS_9718586642 0.0479209, but that should be paired with AUS_1922897632 0.408708 since they both had the same piat_retest_id_2. – syntheso May 28 '19 at 17:45
  • That's the point I'm trying to make. This seems like an issue with the _underlying data_, not the _merge process_. In your provided df, the respective entries for those data points are `43 AUS_485797627 1.183409 AUS_1922897632` and `44 AUS_192289763 0.211094 AUS_9718586642`. Therefore, when you merge on that `id_2` column, it's doing exactly what you're telling it to do. If you fix whatever underlying process is mapping `id_1` to the apparently incorrect `id_2`, then the merge issue will go away – G. Anderson May 28 '19 at 17:52
  • 1
    Ah, bingo! Thank you so much, you are absolutely right. I assumed it was my code and was not looking in the right place. All resolved : ) – syntheso May 28 '19 at 18:09

1 Answers1

1

Use left_on and right_on in order to merge because you have different keys in each dataframe.

piat_comp2 = piat_comp.merge(
    piat2, 
    left_on='piat_retest_id_1',
    right_on='piat_retest_id_2', 
    how='outer'
)
iamchoosinganame
  • 1,090
  • 6
  • 15