1

Having the following Python code, was trying to use pd.merge but seems that key columns requires to be identical. Trying to to something similar to SQL join with "like" operator from df.B with categories.Pattern.

UPDATE with better data example.

import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 'Gas Station'], [2, 'Servicenter'], [5, 'Bakery good bread'], [58, 'Fresh market MIA'], [76, 'Auto Liberty aa1121']], columns=['A','B'])

    Out[12]:
    A   B
0   1   Gas Station
1   2   Servicenter
2   5   Bakery good bread
3   58  Fresh market MIA
4   76  Auto Liberty aa1121

categories = pd.DataFrame([['Gasoline', 'Gas Station'], ['Gasoline', 'Servicenter'], ['Food', 'Bakery'],  ['Food', 'Fresh market'], ['Insurance', 'Auto Liberty']], columns=['Category','Pattern'])

    Out[13]:
    Category    Pattern
0   Gasoline    Gas Station
1   Gasoline    Servicenter
2   Food    Bakery
3   Food    Fresh market
4   Insurance   Auto Liberty

Expected result is:

    Out[14]:
    A   B                   Category
0   1   Gas Station         Gasoline
1   2   Servicenter         Gasoline
2   5   Bakery good bread   Food
3   58  Fresh market MIA    Food
4   58  Auto Liberty aa1121 Insurance

Appreciate your suggestions/feedback.

Luis R. Gonzalez
  • 358
  • 3
  • 16
  • Possible duplicate of [is it possible to do fuzzy match merge with python pandas?](https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas) – cs95 Jul 18 '17 at 23:10
  • We'd appreciate what you expect the outcome to be – piRSquared Jul 18 '17 at 23:15
  • the fuzzy match proposal seems do not apply here. I would like to use the categories.Pattern value and use it as we normally do with str.contains for example, but returns categories.Category value if matches. – Luis R. Gonzalez Jul 20 '17 at 02:33

2 Answers2

1

By Creating a new function like:

def lookup_table(value, df):
    """

    :param value: value to find the dataframe
    :param df: dataframe which constains the lookup table
    :return: 
        A String representing a the data found
    """
    # Variable Initialization for non found entry in list
    out = None
    list_items = df['Pattern'].tolist()
    for item in list_items:
        if item in value:
            out = item
            break
    return out

which will return a new value using a dataframe as look-up table and the parameter value

The following Complete example will show the expected dataframe.

import pandas as pd

df = pd.DataFrame([[1, 'Gas Station'], [2, 'Servicenter'], [5, 'Bakery good bread'], [58, 'Fresh market MIA'], [76, 'Auto Liberty aa1121']], columns=['A','B'])
categories = pd.DataFrame([['Gasoline', 'Gas Station'], ['Gasoline', 'Servicenter'], ['Food', 'Bakery'],  ['Food', 'Fresh market'], ['Insurance', 'Auto Liberty']], columns=['Category','Pattern'])

def lookup_table(value, df):
    """

    :param value: value to find the dataframe
    :param df: dataframe which constains the lookup table
    :return: 
        A String representing a the data found
    """
    # Variable Initialization for non found entry in list
    out = None
    list_items = df['Pattern'].tolist()
    for item in list_items:
        if item in value:
            out = item
            break
    return out


df['Pattern'] = df['B'].apply(lambda x: lookup_table(x, categories))
final = pd.merge(df, categories)

Expected output

0
df['lower'] = df['B'].str.extract(r'([A-z0-9]+)')
categories['lower'] = categories['pattern'].str.extract(r'([A-z0-9]+)')
final = pd.merge(df, categories)
A.Kot
  • 7,615
  • 2
  • 22
  • 24
  • what I would like to to use is categories.Pattern values as part of a regular expression or combination with wildcard to match what I found in df.B. If match, then return the categories.Category value. – Luis R. Gonzalez Jul 20 '17 at 02:30
  • @LuisGonzalez Will there only be three categories? – A.Kot Jul 20 '17 at 13:54
  • The categories dataframe is a lookup table, may vary time basis, including more categories and patterns. – Luis R. Gonzalez Jul 20 '17 at 14:00
  • What you're asking to be a lookup table doesn't make sense. There is a lot of ambiguity. How does abc123 get classified as Alpha and not Numeric. Numeric has 123 in it. – A.Kot Jul 20 '17 at 14:13
  • You are right, the example provided is ambiguous. I did proper update. Still not getting how to do it, maybe the way could be using regular expressions but I would like to have the categories in a dataframe or database table, having each entry unique. – Luis R. Gonzalez Jul 26 '17 at 21:45
  • Now you're facing a simple merge: `df.merge(categories, how='left', left_on=['B'], right_on=['Pattern']` – A.Kot Jul 27 '17 at 14:55
  • Not precisely, because the merge requires a exact match, then your merge suggestion will only works on those records having B and Pattern values. – Luis R. Gonzalez Jul 27 '17 at 18:10
  • SQL's basic like operator looks for a string segment within another to match it. Here's a quick way to do it: https://stackoverflow.com/questions/42698281/pandas-if-value-in-a-dataframe-contains-string-from-another-dataframe-append – A.Kot Jul 27 '17 at 18:29
  • Right, seems the suggested method is the fuzzy logic but does not fit as solution to me. – Luis R. Gonzalez Jul 27 '17 at 22:34