1

I have two dataframes and both contains sql table.

This is my first Dataframe

Original_Input           Cleansed_Input        Core_Input    Type_input
TECHNOLOGIES S.A         TECHNOLOGIES SA        
A & J INDUSTRIES, LLC    A J INDUSTRIES LLC     
A&S DENTAL SERVICES      AS DENTAL SERVICES     
A.M.G Médicale Inc       AMG Mdicale Inc        
AAREN SCIENTIFIC         AAREN SCIENTIFIC   

My second dataframe is :

Name_Extension     Company_Type     Priority
co llc             Company LLC       2
Pvt ltd            Private Limited   8
Corp               Corporation       4
CO Ltd             Company Limited   3
inc                Incorporated      5
CO                 Company           1

I removed, punctuations, ASCII and digits and have put this data in the cleansed_input column in df1.

That cleansed_input column in df1 needs to be checked with the Name_Extension column of df2. If the value from cleansed_input has any value from Name_Extension at the end then that should be split and put in type_input column of df1 and not just like that but abbreviated.

For example, if CO is present in cleansed_column then that should be abbreviated as Company and put in the type_input column and the remaining text should be in core_type column of df1. Also there is priority given, am not sure if thats needed.

Expected output:

Original_Input          Cleansed_Input        Core_Input       Type_input
TECHNOLOGIES S.A        TECHNOLOGIES SA       TECHNOLOGIES      SA
A & J INDUSTRIES, LLC   A J INDUSTRIES LLC    A J INDUSTRIES    LLC
A&S DENTAL SERVICES     AS DENTAL SERVICES      
A.M.G Médicale Inc      AMG Mdicale Inc       AMG Mdicale       Incorporated
AAREN SCIENTIFIC        AAREN SCIENTIFIC        

I tried many methods like with isin, mask, contains, etc but am not sure what to put in where.

I got an error that said "Series are mutable, they cannot be hashed". I was unsure of why I got that error when I was trying things with dataframe.

I am not having that code and am using jupiter notebook and sql server and isin doesn't seem to work in jupiter.

Same way there is another split to be done. The original_input column to be split as parent_compnay name and alias name.

Here is my code:

import pyodbc
import pandas as pd
import string
from string import digits
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.types import String
from io import StringIO
from itertools import chain
import re

#Connecting SQL with Python

server = '172.16.15.9'
database = 'Database Demo'
username = '**'
password = '******'


engine = create_engine('mssql+pyodbc://**:******@'+server+'/'+database+'? 
driver=SQL+server')

#Reading SQL table and grouping by columns
data=pd.read_sql('select * from [dbo].[TempCompanyName]',engine)
#df1=pd.read_sql('Select * from company_Extension',engine)
#print(df1)
#gp = df.groupby(["CustomerName", "Quantity"]).size() 
#print(gp)

#1.Removing ASCII characters
data['Cleansed_Input'] = data['Original_Input'].apply(lambda x:''.join(['' 
if ord(i) < 32 or ord(i) > 126 else i for i in x]))

#2.Removing punctuations
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda 
x:''.join([x.translate(str.maketrans('', '', string.punctuation))]))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i 
in x if i not in string.punctuation]))

#3.Removing numbers in a table.
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda 
x:x.translate(str.maketrans('', '', string.digits)))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i 
in x if i not in string.digits]))

#4.Removing trialing and leading spaces 
data['Cleansed_Input']=df['Cleansed_Input'].apply(lambda x: x.strip())

df=pd.DataFrame(data)
#data1=pd.DataFrame(df1)


df2 = pd.DataFrame({ 
"Name_Extension": ["llc",
                   "Pvt ltd",
                   "Corp",
                   "CO Ltd",
                   "inc", 
                   "CO",
                   "SA"],
"Company_Type": ["Company LLC",
                 "Private Limited",
                 "Corporation",
                 "Company Limited",
                 "Incorporated",
                 "Company",
                 "Anonymous Company"],
"Priority": [2, 8, 4, 3, 5, 1, 9]
})

data.to_sql('TempCompanyName', con=engine, if_exists='replace',index= False)

3 Answers3

0

Here is a possible solution you can implement:

df = pd.DataFrame({
    "Original_Input": ["TECHNOLOGIES S.A", 
                       "A & J INDUSTRIES, LLC", 
                       "A&S DENTAL SERVICES", 
                       "A.M.G Médicale Inc", 
                       "AAREN SCIENTIFIC"],
    "Cleansed_Input": ["TECHNOLOGIES SA", 
                       "A J INDUSTRIES LLC", 
                       "AS DENTAL SERVICES", 
                       "AMG Mdicale Inc", 
                       "AAREN SCIENTIFIC"]
})

df_2 = pd.DataFrame({ 
    "Name_Extension": ["llc",
                       "Pvt ltd",
                       "Corp",
                       "CO Ltd",
                       "inc", 
                       "CO",
                       "SA"],
    "Company_Type": ["Company LLC",
                     "Private Limited",
                     "Corporation",
                     "Company Limited",
                     "Incorporated",
                     "Company",
                     "Anonymous Company"],
    "Priority": [2, 8, 4, 3, 5, 1, 9]
})

# Preprocessing text
df["lower_input"] = df["Cleansed_Input"].str.lower()
df_2["lower_extension"] = df_2["Name_Extension"].str.lower()

# Getting the lowest priority matching the end of the string
extensions_list = [ (priority, extension.lower_extension.values[0]) 
                    for priority, extension in df_2.groupby("Priority") ]
df["extension_priority"] = df["lower_input"] \
    .apply(lambda p: next(( priority 
                            for priority, extension in extensions_list 
                            if p.endswith(extension)), None))

# Merging both dataframes based on priority. This step can be ignored if you only need
# one column from the df_2. In that case, just give the column you require instead of 
# `priority` in the previous step.
df = df.merge(df_2, "left", left_on="extension_priority", right_on="Priority")

# Removing the matched extensions from the `Cleansed_Input` string
df["aux"] = df["lower_extension"].apply(lambda p: -len(p) if isinstance(p, str) else 0)
df["Core_Input"] = df.apply(
    lambda p: p["Cleansed_Input"] 
              if p["aux"] == 0 
              else p["Cleansed_Input"][:p["aux"]].strip(), 
    axis=1
)

# Selecting required columns
df[[ "Original_Input", "Core_Input", "Company_Type", "Name_Extension" ]]

I assumed that the "Priority" column would have unique values. However, if this isn't the case, just sort the priorities and create an index based on that order like this:

df_2.sort_values("Priority").assign(index = range(df_2.shape[0]))

Also, next time give the data example in a format that allows anyone to load easily. It was cumbersome to handle the format you sent.

EDIT: Not related with the question, but it might be of some help. You can simplify the steps from 1 to 4 with the following:

data['Cleansed_Input'] = data["Original_Input"] \
    .str.replace("[^\w ]+", "") \ # removes non-alpha characters
    .str.replace(" +", " ") \ # removes duplicated spaces
    .str.strip() # removes spaces before or after the string

EDIT 2: SQL version of the solution (I'm using PostgreSQL, but I used standard SQL operators, so the differences shouldn't be that huge).

SELECT t.Original_Name,
       t.Cleansed_Input,
       t.Name_Extension,
       t.Company_Type,
       t.Priority
FROM (
    SELECT df.Original_Name,
           df.Cleansed_Input,
           df_2.Name_Extension,
           df_2.Company_Type,
           df_2.Priority,
           ROW_NUMBER() OVER (PARTITION BY df.Original_Name ORDER BY df_2.Priority) AS rn
    FROM (VALUES ('TECHNOLOGIES S.A', 'TECHNOLOGIES SA'), ('A & J INDUSTRIES, LLC', 'A J INDUSTRIES LLC'),
                 ('A&S DENTAL SERVICES', 'AS DENTAL SERVICES'), ('A.M.G Médicale Inc', 'AMG Mdicale Inc'),
                 ('AAREN SCIENTIFIC', 'AAREN SCIENTIFIC')) df(Original_Name, Cleansed_Input)
         LEFT JOIN (VALUES ('llc', 'Company LLC', '2'), ('Pvt ltd', 'Private Limited', '8'), ('Corp', 'Corporation', '4'),
                           ('CO Ltd', 'Company Limited', '3'), ('inc', 'Incorporated', '5'), ('CO', 'Company', '1'),
                           ('SA', 'Anonymous Company', '9')) df_2(Name_Extension, Company_Type, Priority)
            ON  lower(df.Cleansed_Input) like ( '%' || lower(df_2.Name_Extension) )
) t
WHERE rn = 1
MkWTF
  • 1,372
  • 7
  • 11
  • Hey there, thanks for the answer. But still am not getting the output needed. It just copied the cleansed_input to a new column named lower input. Its not splitting up. –  Jan 10 '20 at 12:32
  • Did you preprocess the 'Cleansed_Input' to be lowercase? Could you share your code? – MkWTF Jan 10 '20 at 14:06
  • @DhanalakshmiV I dont see the code you used to find the respective extension for each row in your `df`. Could you add it too? Also, check my edit, it might be of some help. – MkWTF Jan 13 '20 at 11:07
  • Thank you so much, that worked for numbers,punctuations and spaces. And I didnt save any code that i did for splitting. What I did was I tried to write the code in sql and executed via python since I was unable to do it through py. –  Jan 14 '20 at 04:58
  • engine.execute('''update A set A.Type_input = B.Company_Type from [dbo].[TempCompanyName] A (nolock), [dbo].[company_Extension]B where A.Cleansed_Input like '%'+B.Name_Extension''') engine.execute('''update A set A.Core_Input =replace(A.[Cleansed_Input],B.Name_Extension,'') from [TempCompanyName] A (nolock), [company_Extension]B where A.Cleansed_Input like '%'+B.Name_Extension''') engine.execution_options(autocommit=True) –  Jan 14 '20 at 05:14
  • This is the syntax I used to split and replace –  Jan 14 '20 at 05:14
  • @DhanalakshmiV In SQL is easier, and if you are getting the data from the database already, you can simply do all the transformations in there. [The regex part can be done in there too.](https://stackoverflow.com/a/23001916/4110792) Check my edit for the rest – MkWTF Jan 14 '20 at 10:17
0

IIUC, we can use some basic regex :

first we remove any trailing and leading white space and split by white space, this returns a list of lists which we can break out using chain.from_iterable

then we use some regex with pandas methods str.findall and str.contains to match your inputs.

from itertools import chain

ext = df2['Name_Extension'].str.strip().str.split('\s+')

ext = list(chain.from_iterable(i for i in ext))

df['Type_Input'] = df['Cleansed_Input'].str.findall('|'.join(ext),flags=re.IGNORECASE).str[0]

s = df['Cleansed_Input'].str.replace('|'.join(ext),'',regex=True,case=False).str.strip()

df.loc[df['Type_Input'].isnull()==False,'Core_Input'] = s

print(df)

          Original_Input      Cleansed_Input type_input      core_input
0       TECHNOLOGIES S.A     TECHNOLOGIES SA        NaN             NaN
1  A & J INDUSTRIES, LLC  A J INDUSTRIES LLC        LLC  A J INDUSTRIES
2    A&S DENTAL SERVICES  AS DENTAL SERVICES        NaN             NaN
3     A.M.G Médicale Inc     AMG Mdicale Inc        Inc     AMG Mdicale
4       AAREN SCIENTIFIC    AAREN SCIENTIFIC        NaN             NaN
Umar.H
  • 22,559
  • 7
  • 39
  • 74
  • Thank you so much Datanovice I am getting the split up but abbreviation part is not done. Am trying to do that –  Jan 10 '20 at 12:36
  • I just ran it over your raw data and it worked fine, what was the error you are getting? @DhanalakshmiV – Umar.H Jan 10 '20 at 12:43
  • No error but it came in different columns. Now I tried to drop and create the table again and run and put in the correct column names but getting programming error.Can you just help me by putting in the same column names in your code like instead of renaming columns. –  Jan 10 '20 at 12:45
  • @DhanalakshmiV what do you mean by different columns? I renamed to match your output column names, I'm not sure what the issue is. – Umar.H Jan 10 '20 at 12:48
  • Original_Input Cleansed_Input Core_Input Type_input ext TECHNOLOGIES S.A TECHNOLOGIES SA SA company TECHNOLOGIES –  Jan 10 '20 at 12:51
  • Original_Input Cleansed_Input Core_Input Type_input ext company –  Jan 10 '20 at 12:52
  • The above are the column names I get Datanovice.Like after the code i tried writing back to sql .So in sql I got these columns –  Jan 10 '20 at 12:53
  • @DhanalakshmiV no idea, but that's beyond the scope of your question, so i suggest you close this one and open another :) – Umar.H Jan 10 '20 at 13:56
0

Assuming you have read in the dataframes as df1 and df2, the first step is to create 2 lists - one for the Name_Extension (keys) and one for Company_Type (values) as shown below:

keys = list(df2['Name_Extension'])
keys = [key.strip().lower() for key in keys]
print (keys)
>>> ['co llc', 'pvt ltd', 'corp', 'co ltd', 'inc', 'co']
values = list(df2['Company_Type']) 
values = [value.strip().lower() for value in values]
print (values)
>>> ['company llc', 'private limited', 'corporation', 'company limited', 'incorporated', 'company']

Next step will be to map each value in the Cleansed_Input to a Core_Inputand Type_Input. We can use pandas apply method on the Cleansed_Input column To get the Core_input:

def get_core_input(data):
    # preprocess
    data = str(data).strip().lower()
    # check if the data end with any of the keys
    for key in keys:
        if data.endswith(key):
            return data.split(key)[0].strip() # split the data and return the part without the key
    return None

df1['Core_Input'] = df1['Cleansed_Input'].apply(get_core_input)
print (df1)
>>>
 Original_Input      Cleansed_Input   Core_Input  Type_input
0       TECHNOLOGIES S.A     TECHNOLOGIES SA         None         NaN
1  A & J INDUSTRIES, LLC  A J INDUSTRIES LLC         None         NaN
2    A&S DENTAL SERVICES  AS DENTAL SERVICES         None         NaN
3     A.M.G Médicale Inc     AMG Mdicale Inc  amg mdicale         NaN
4       AAREN SCIENTIFIC   AAREN SCIENTIFIC          None         NaN

To get the Type_input:

def get_type_input(data):
    # preprocess
    data = str(data).strip().lower()
    # check if the data end with any of the keys
    for idx in range(len(keys)):
        if data.endswith(keys[idx]):
            return values[idx].strip() # return the value of the corresponding matched key
    return None

df1['Type_input'] = df1['Cleansed_Input'].apply(get_type_input)
print (df1)
>>>
Original_Input      Cleansed_Input   Core_Input    Type_input
0       TECHNOLOGIES S.A     TECHNOLOGIES SA         None          None
1  A & J INDUSTRIES, LLC  A J INDUSTRIES LLC         None          None
2    A&S DENTAL SERVICES  AS DENTAL SERVICES         None          None
3     A.M.G Médicale Inc     AMG Mdicale Inc  amg mdicale  incorporated
4       AAREN SCIENTIFIC   AAREN SCIENTIFIC          None          None

This is a pretty easy to follow solution, but is not the most efficient way to solve the problem, I am sure.. Hopefully it solves your use case.

Shaunak Sen
  • 548
  • 3
  • 8