3

I have 2 datasets, which contain unique values. The first one, which is the one below has the full name of a few hospitals.

+-----------------------+
| HOSPITAL_NAME_FULL    |
+-----------------------+
| St. Christine         |
| Californian Hospital  |
| Holy Mercy Hospital   |
| Germanic NW Hospital  |
| Trauma Center Hospital|
| Holy Spirit Hospital  |
| Mater Hospital        |
+-----------------------+

The other one has the short name of the same hospitals above.

+---------------------+
| HOSPITAL_NAME_SHORT |
+---------------------+
| Christine           |
| Californian         |
| Mercy               |
| Germanic            |
| Trauma              |
| Holy                |
| Mater               |
+---------------------+

The thing is, I need to join them, so I can have both full name and short name. Can I join dataframes while using some kind of regex so I can have this result?

+-----------------------+---------------------+ 
| HOSPITAL_NAME_FULL    | HOSPITAL_NAME_SHORT |
+-----------------------+---------------------+
| St. Christine         | Christine           |
| Californian Hospital  | Californian         |
| Holy Mercy Hospital   | Mercy               |
| Germanic NW Hospital  | Germanic            |
| Trauma Center Hospital| Trauma              |
| Holy Spirit Hospital  | Holy                |
| Mater Hospital        | Mater               |
+-----------------------+---------------------+

Thanks!

I'mahdi
  • 23,382
  • 5
  • 22
  • 30
Marcos Dias
  • 420
  • 3
  • 9

3 Answers3

3

TL;DR, If your words in HOSPITAL_NAME_SHORT exactly exist in HOSPITAL_NAME_FULL I recommend the first approach and if you want to consider similarity between words, I recommend second approach:

First Approach: (word exactly exist)

df1['HOSPITAL_NAME_SHORT'] = df1['HOSPITAL_NAME_FULL'].apply(
    lambda x: next(st for st in df2['HOSPITAL_NAME_SHORT'] if st in x))

print(df1)

Second Approach (consider Similarty)

You can use difflib.SequenceMatcher and find similarities between each two words from two dataframes and return a word that have max similarity for each row:

What is difflib.SequenceMatcher:

>>> SequenceMatcher(None, 'Vitamin_A', 'Vitamin_C').ratio()
0.8888888888888888

Solving this problem with SequenceMatcher:

from difflib import SequenceMatcher
df1['HOSPITAL_NAME_SHORT'] = df1['HOSPITAL_NAME_FULL'].apply(
    lambda x: max([(st, SequenceMatcher(None, x, st).ratio()) 
                   for st in df2['HOSPITAL_NAME_SHORT']], key=lambda x: x[1])[0])

print(df1)

Output:

       HOSPITAL_NAME_FULL HOSPITAL_NAME_SHORT
0           St. Christine           Christine
1    Californian Hospital         Californian
2     Holy Mercy Hospital               Mercy
3    Germanic NW Hospital            Germanic
4  Trauma Center Hospital              Trauma
5    Holy Spirit Hospital                Holy
6          Mater Hospital               Mater

Input two dataframe:

print(df1)
#        HOSPITAL_NAME_FULL
# 0           St. Christine
# 1    Californian Hospital
# 2     Holy Mercy Hospital
# 3    Germanic NW Hospital
# 4  Trauma Center Hospital
# 5    Holy Spirit Hospital
# 6          Mater Hospital

print(df2)
#   HOSPITAL_NAME_SHORT
# 0           Christine
# 1         Californian
# 2               Mercy
# 3            Germanic
# 4              Trauma
# 5                Holy
# 6               Mater
I'mahdi
  • 23,382
  • 5
  • 22
  • 30
  • 1
    Note the warning in the [`Series.values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.values.html) docs recommending use of `to_numpy()` instead. – constantstranger Jun 26 '22 at 14:02
  • 1
    In fact, in this case I think you can use the Series directly in your list comprehension with no method call (see my answer, for example). – constantstranger Jun 26 '22 at 14:06
  • @constantstranger, you are right, I'm too fast for sending and don't consider – I'mahdi Jun 26 '22 at 14:07
2

You can try .str.extract

df1['HOSPITAL_NAME_SHORT'] = df1['HOSPITAL_NAME_FULL'].str.extract('('+'|'.join(df2['HOSPITAL_NAME_SHORT'])+')')
print(df1)

       HOSPITAL_NAME_FULL HOSPITAL_NAME_SHORT
0           St. Christine           Christine
1    Californian Hospital         Californian
2     Holy Mercy Hospital                Holy
3    Germanic NW Hospital            Germanic
4  Trauma Center Hospital              Trauma
5    Holy Spirit Hospital                Holy
6          Mater Hospital               Mater

You can also consider str.extractall to show to all possible matches

df1 = (df1
       .join(
           df1['HOSPITAL_NAME_FULL'].str.extractall('('+'|'.join(df2['HOSPITAL_NAME_SHORT'])+')')
           .unstack().droplevel(0, axis=1)
       )
)
print(df1)

       HOSPITAL_NAME_FULL            0      1
0           St. Christine    Christine    NaN
1    Californian Hospital  Californian    NaN
2     Holy Mercy Hospital         Holy  Mercy
3    Germanic NW Hospital     Germanic    NaN
4  Trauma Center Hospital       Trauma    NaN
5    Holy Spirit Hospital         Holy    NaN
6          Mater Hospital        Mater    NaN
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
1

Here's a way to do it:

df1['HOSPITAL_NAME_SHORT'] = df1.HOSPITAL_NAME_FULL.apply(lambda x: [y for y in df2.HOSPITAL_NAME_SHORT if y in x][0])

Output:

       HOSPITAL_NAME_FULL HOSPITAL_NAME_SHORT
0           St. Christine           Christine
1    Californian Hospital         Californian
2     Holy Mercy Hospital               Mercy
3    Germanic NW Hospital            Germanic
4  Trauma Center Hospital              Trauma
5    Holy Spirit Hospital                Holy
6          Mater Hospital               Mater
constantstranger
  • 9,176
  • 2
  • 5
  • 19