1

I am trying to left join two data sets based on a four digit code in each. One data set has the codes filled in to varying degrees (2,3, or all 4 digits) with trailing zeroes as needed.

The other data set has the codes completed to all four digits.

If the last two digits of CodeA are 00 then I want to join to any CodeB with the same first two digits. If only the last digit of CodeA is 0 then I want to join to all CodeBs that have the same first three digits. If CodeA has all four digits then I want to join to those exact same codes in CodeB.

Example:

CodeA data set

Example CodeA   Field1
1   2500    w
2   4110    x
3   2525    y
4   5345    z



CodeB data set
CodeB   Field2
1234    a
2525    b
4113    c
6543    d
5341    e
2522    f
4122    g
5345    h

I want my result data set to look like this:

Ex  CodeA   Field1  CodeB   Field2
1   2500    w   2525    b
1   2500    w   2522    f
2   4110    x   4113    c
3   2525    y   2525    b
4   5345    z   5345    h
jpp
  • 159,742
  • 34
  • 281
  • 339
user3242036
  • 645
  • 1
  • 7
  • 16

2 Answers2

0

If you know that all of your codes are 4 digits and that 0 is not a valid digit (other than to pad), you can consider splitting the problem up into 3 joins. One on 4 digit codes to 4 digit codes, one on 3 to 3 and the other on 2 to 2. You can accomplish this by first subsetting CodeA data set into those with no trailing 0s, 1 0 and 2 0s. Then split the CodeA and CodeB fields to be 4, 3, and 2 digits long, respectively. Do the three joins separately and then concatenate. An example is below. (The data is not exactly the same as yours).

import pandas as pd

df = pd.DataFrame(data=[[2500, 1],
                         [2525, 2],
                         [5345, 3],
                         [2520, 4]],
                  columns=['CodeA', 'Field1'])
df_2 = pd.DataFrame(data=[[2525, 1],
                          [2532, 2],
                          [5345, 2]],
                    columns=['CodeB', 'Field2'])

df['CodeA'] = df['CodeA'].astype(str)
df_2['CodeB'] = df_2['CodeB'].astype(str)

df_2['CodeB2'] = df_2['CodeB'].str.slice(0, 2)
df_2['CodeB3'] = df_2['CodeB'].str.slice(0, 3)

df['CodeA2'] = df['CodeA'].str.slice(0, 2)
df['CodeA3'] = df['CodeA'].str.slice(0, 3)

df['Use2'] = df['CodeA'].str.slice(2) == '00'
df['Use3'] = (df['CodeA'].str.slice(3) == '0') & (~df['Use2'])

df_use_all = df[(~df['Use2']) & (~df['Use3'])]
df_use_2 = df[df['Use2']]
df_use_3 = df[df['Use3']]

df_use_all = df_use_all.merge(df_2, left_on=['CodeA'], right_on=['CodeB'], how='left')
df_use_2 = df_use_2.merge(df_2, left_on=['CodeA2'], right_on=['CodeB2'], how='left')
df_use_3 = df_use_3.merge(df_2, left_on=['CodeA3'], right_on=['CodeB3'], how='left')

df_all = pd.concat([df_use_all, df_use_2, df_use_3])
scomes
  • 1,756
  • 14
  • 19
0

This is one solution via collections.defaultdict. The triple-merge method (as detailed by @scomes) is the conventional algorithm, but you might find this one easier to maintain.

In the solution below, df1 is CodeA dataset and df2 is CodeB dataset.

from collections import defaultdict
from itertools import chain

d = defaultdict(list)

# create full set from df2 and test set from df1
full_set = list(zip(df2['CodeB'].astype(str), df2['Field2']))
test_set = set(df1['CodeA'].astype(str).str.rstrip('0'))

# create dictionary of mapping (loopy)
for k in test_set:
    d[k].extend([i for i in full_set if i[0].startswith(k)])

# map dictionary
df1['Mapped'] = df1['CodeA'].astype(str).str.rstrip('0').map(d)

# expand column of list of tuples
lens = list(map(len, df1['Mapped']))

res = pd.DataFrame({'Example': np.repeat(df1['Example'], lens),
                    'CodeA': np.repeat(df1['CodeA'], lens),
                    'Field1': np.repeat(df1['Field1'], lens),
                    'Mapped': list(chain.from_iterable(df1['Mapped']))})

# split column of tuples
res[['CodeB', 'Field2']] = res['Mapped'].apply(pd.Series)

# drop helper column
res = res.drop('Mapped', 1)

Result

   CodeA  Example Field1 CodeB Field2
0   2500        1      w  2525      b
0   2500        1      w  2522      f
1   4110        2      x  4113      c
2   2525        3      y  2525      b
3   5345        4      z  5345      h
jpp
  • 159,742
  • 34
  • 281
  • 339
  • In trying to implement this I'm curious what to do if I actually have a large number of other columns in each data set that need to be carried over without listing them all out? – user3242036 Mar 15 '18 at 14:45
  • There are many ways to split a column of lists into multiple rows, see [1](https://stackoverflow.com/questions/39011511/pandas-expand-rows-from-list-data-available-in-column), [2](https://stackoverflow.com/questions/27263805/pandas-when-cell-contents-are-lists-create-a-row-for-each-element-in-the-list), [3](https://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows/32470490), and many more.. You can choose a method that works for you. – jpp Mar 15 '18 at 14:50
  • What I meant was if I have df2['Field3'], df2['Field4'], etc, etc and df1['Field5'], df1['Field6'], etc, etc. Do I have to list all these out in order to maintain them in the merge? – user3242036 Mar 15 '18 at 14:56
  • In this method, yes, but this bit of the code can be replaced by many other algorithms (I provided links to them in my previous comment). Really what we are doing is splitting a column of lists into rows.. which has many different solutions. – jpp Mar 15 '18 at 14:57
  • Also, if you're comfortable with dictionaries, you could use a dictionary comprehension to construct a dictionary of values, e.g. `{i: np.repeat(df[i], lens) for i in ('a', 'b', 'c')}`. – jpp Mar 15 '18 at 14:59