Create list based on column value and use that list to extract words from string column in df without overwriting row value with for loop

Question

Ok I admit it, I'm stuck. Hope someone can help me figure this out! I'll try to explain to the best of my abilities. I have two df's. One of them has the string column and municipalities and the other df has municipalities and streets. I want to create a street list per row (for that specific municipality) so it only extract streets in the string column for that specific municipality. The code I have now kinda works but it keeps iterating over all of the municipalities, therefore extracting streets in other municipalities and adding streets to the wrong rows. I hope the code examples below make my question a little more clear.

Create dataframes:

import pandas as pd
import re

# Sample dataframe with the municipality and string column
data1 = {'municipality': ['Urk','Utrecht','Almere','Utrecht','Huizen'],
        'text': ["I'm going to Plantage, Pollux and Oostvaardersdiep","Tomorrow I'm going to Hoog Catharijne", 
                 "I'm not going to the Balijelaan","I'm not going to Socrateshof today",
                 "Next week I'll be going to Socrateshof"]}

df = pd.DataFrame(data1, columns = ['municipality','text'])
print(df)

Output:

  municipality                                               text
0          Urk  I'm going to Plantage, Pollux and Oostvaarders...
1      Utrecht              Tomorrow I'm going to Hoog Catharijne
2       Almere                    I'm not going to the Balijelaan
3      Utrecht                 I'm not going to Socrateshof today
4       Huizen             Next week I'll be going to Socrateshof

# Sample dataframe with the municipality and street 
data2 = {'municipality': ['Urk','Urk','Utrecht','Almere','Almere','Huizen'],
        'street_name': ['Plantage','Pollux','Balijelaan','Oostvaardersdiep','Catharijne','Socrateshof']}
df2 = pd.DataFrame(data2, columns = ['municipality','street_name'])
print(df2)

Output:

  municipality       street_name
0          Urk          Plantage
1          Urk            Pollux
2      Utrecht        Balijelaan
3       Almere  Oostvaardersdiep
4       Almere        Catharijne
5       Huizen       Socrateshof

Run the function below:

# Function
street = []
def extract_street(txt):
    mun_list_filter = df['municipality'] # I want the streets for this municipality
    df_bag_filter_mun = df2[df2['municipality'].isin(mun_list_filter)] # Filter second df on the wanted municipality
    street_list_mun = list(df_bag_filter_mun['street_name'].unique()) # Select all unique streets for the specific municipality
    st = re.findall(r"\b|".join(street_list_mun), txt) # Find all the streets in the string column 'tekst'
    street.append(st) # Append to empty street list
    return street # As you can see it keeps iterating over all municipalities 

# Call function by iterating over rows in string column
for txt in df['text']:
    extract_street(txt)

# Add street list to df
df = df.assign(**{'street_match': street})
df['street_match'] = [', '.join(map(str, l)) for l in df['street_match']]
df

Output:

    municipality text                                                street_match
0   Urk          I'm going to Plantage, Pollux and Oostvaardersdiep  Plantage, Pollux, Oostvaardersdiep
1   Utrecht      Tomorrow I'm going to Hoog Catharijne               Catharijne
2   Almere       I'm not going to the Balijelaan                     Balijelaan
3   Utrecht      I'm not going to Socrateshof today                  Socrateshof
4   Huizen       Next week I'll be going to Socrateshof              Socrateshof

As you can see in the first row for municipality 'Urk' the function added the street 'Oostvaardersdiep' even though this should've only been matched if the municipality for the first row is 'Almere'. Only the last row is correct since 'Socrateshof' is in fact in the municipality 'Huizen'.

Desired result:

    municipality text                                                street_match
0   Urk          I'm going to Plantage, Pollux and Oostvaardersdiep  Plantage, Pollux
1   Utrecht      Tomorrow I'm going to Hoog Catharijne              
2   Almere       I'm not going to the Balijelaan                    
3   Utrecht      I'm not going to Socrateshof today                 
4   Huizen       Next week I'll be going to Socrateshof              Socrateshof

I know what the problem is I just don't know how to fix it. I've tried with apply/lambda but no luck either. Thanks!

Shouldn't you be whether declaring `street` local to `extract_street` or emptying it with every call? Otherwise `street` will be accumulating results from previous calls. — rturrado, Feb 09 '21 at 12:14
@rturrado That was my initial thought too but `street` is effectively a _list of lists_ - one list of street names for each row of the overall result. — aneroid, Feb 09 '21 at 12:47

score 2 · Accepted Answer · answered Feb 09 '21 at 13:42

Adding another answer to show a shorter/simpler way to do what you wanted. (The first one was just to fix what was not working in your code.)

Using .apply(), you can call a modified verison of your function per row of df and then do the checking with the street names in df2.

def extract_street(row):
    street_list_mun = df2.loc[df2['municipality'] == row['municipality'], 'street_name'].unique()
    streets_regex = r'\b(' + '|'.join(street_list_mun) + r')\b'
    streets_found = set(re.findall(streets_regex, row['text']))
    return ', '.join(streets_found)
    ## or if you want this to return a list of streets
    # return list(streets_found)

df['street_match'] = df.apply(extract_street, axis=1)
df

Output:

  municipality                                                text      street_match
0          Urk  I'm going to Plantage, Pollux and Oostvaardersdiep  Plantage, Pollux
1      Utrecht               Tomorrow I'm going to Hoog Catharijne                  
2       Almere                     I'm not going to the Balijelaan                  
3      Utrecht                  I'm not going to Socrateshof today                  
4       Huizen              Next week I'll be going to Socrateshof       Socrateshof

Note:

There's an issue with your regex - the join part of the expression generates strings like Plantage\b|Pollux. Which will give a match if (a) the last street name is at the beginning of another word or (b) if the any-except-the-last street names is at the end of another word: "I'm going to NotPlantage, Polluxsss and Oostvaardersdiep" will match for both streets, but it shouldn't. Instead, the word boundary \b should be at ends of the list of options and with parentheses to separate them. It should generate strings like: \b(Plantage|Pollux)\b. This won't match with "Polluxsss" or "NotPlantage". I've made that change in the code above.
I'm using set to get a unique list of street matches. If the line was "I'm going to Pollux, Pollux, Pollux" it would haven given the result 3 times instead of just once.

First of all thanks so much! It's exactly what I need. Took a while to get back to you cause I first wanted to figure out your code and try some additional but similar techniques based on your code. I got stuck again haha, i know it's a lot to ask but if you do find the time I would (again) appreciate it! See my own "answer" below. — avgjoe13, Feb 15 '21 at 17:12

aneroid · Answer 2 · 2021-02-15T18:18:33.937

One problem with passing in only the text is that you can't do the municipality filter. Which is why you're getting the street 'Oostvaardersdiep' for 'Urk', even though it's in 'Almere'. You get it because the name 'Oostvaardersdiep' appears in the text for the 'Urk' entry. Your extract_streets() function doesn't know which municipality to be matching with.

The smallest change to get your code to work is this:

Pass in mun along with txt to extract_street()
mun_list_filter should use the mun instead of all the municipalities

street = []
def extract_street(txt, mun):  # Pass in municipality
    df_bag_filter_mun = df2[df2['municipality'] == mun]
    ### everything below is COPY-PASTED from your question
    street_list_mun = list(df_bag_filter_mun['street_name'].unique()) # Select all unique streets for the specific municipality
    st = re.findall(r"\b|".join(street_list_mun), txt) # Find all the streets in the string column 'tekst'
    street.append(st) # Append to empty street list
    return street # As you can see it keeps iterating over all municipalities 

# add the 'municipality' for the extract loop
for txt, mun in zip(df['text'], df['municipality']):  
    extract_street(txt, mun)

# Add street list to df
df = df.assign(**{'street_match': street})

Output:

  municipality                                                text        street_match
0          Urk  I'm going to Plantage, Pollux and Oostvaardersdiep  [Plantage, Pollux]
1      Utrecht               Tomorrow I'm going to Hoog Catharijne                  []
2       Almere                     I'm not going to the Balijelaan                  []
3      Utrecht                  I'm not going to Socrateshof today                  []
4       Huizen              Next week I'll be going to Socrateshof       [Socrateshof]

And then join the list to make it a string:

df['street_match'] = df['street_match'].str.join(', ')

Output:

  municipality                                                text      street_match
0          Urk  I'm going to Plantage, Pollux and Oostvaardersdiep  Plantage, Pollux
1      Utrecht               Tomorrow I'm going to Hoog Catharijne                  
2       Almere                     I'm not going to the Balijelaan                  
3      Utrecht                  I'm not going to Socrateshof today                  
4       Huizen              Next week I'll be going to Socrateshof       Socrateshof

FYI, there's a small error in the regex from the original code, fixed in [my other answer](https://stackoverflow.com/a/66120173/1431750). And fixed an issue of one street name appearing multiple times in 'text' giving the same street match multiple times instead of just once. — aneroid, Feb 09 '21 at 13:45

avgjoe13 · Answer 3 · 2021-02-15T17:31:34.820

@aneroid I now want to extract multiple exact matches (which are in a list) from a similar text column. The code below (based on your regex) works for this simple example but on my larger more complex dataset I get a bunch of tuples and empty strings.. Do you know how I could improve this code?

# String column
data1 = {'text': ["Today I'm going to Utrecht","Tomorrow I'm going to Utrecht and Urk", 
                 "Next week I'll be going to the Amsterdamsestraatweg"]}

df = pd.DataFrame(data1, columns = ['text'])
print(df)

# City column in other df
data2 = {'city': ['Urk','Utrecht','Almere','Huizen','Amsterdam','Urk']}
df2 = pd.DataFrame(data2, columns = ['city'])
print(df2)

# I create a list of all the unique cities in df2
city_list = list(df2['city'].unique())
len(city_list)
len(set(city_list))

# Extract the words if there is an exact match 
df['city_match'] = df['text'].str.findall(r'\b(' + '|'.join(city_list) + r')\b')
df['city_match'] = [', '.join(map(str, l)) for l in df['city_match']]
print(df)

# Output
                                                text    city_match
0                         Today I'm going to Utrecht       Utrecht
1              Tomorrow I'm going to Utrecht and Urk  Utrecht, Urk
2  Next week I'll be going to the Amsterdamsestra...

As you can see it works. The 'Amsterdamsestraatweg' is not an exact match so it didn't match. Strangely in my larger df I get a bunch of tuples and empty strings as output like so:

0                        ('Wijk bij Duurstede', '', '')
6                                   ('Utrecht', '', '')
7     ('Huizen', '', ''), ('Huizen', '', ''), ('Huiz...
9     ('Utrecht', '', ''), ('Utrecht', '', ''), ('Ut...
10                     ('Urk', '', ''), ('Urk', '', '')
11    ('Amersfoort', '', ''), ('Amersfoort', '', '')...
12                                 ('Lelystad', '', '')
13             ('Utrecht', '', ''), ('Utrecht', '', '')
16    ('Hilversum', '', ''), ('Hilversum', '', ''), ...
18             ('De Bilt', '', ''), ('De Bilt', '', '')
19                                      ('Urk', '', '')

Thanks again

**1.** For the lines `city_list = list(df2['city'].unique())` and `df['city_match'] = df['text'].str.findall(r'\b(' + '|'.join(city_list) + r')\b')`, `city_list` doesn't need to be a `list`. Doing just `city_list = df2['city'].unique()` is enough for the regex to work properly. **2.** To join the list items into a single string, you can do `df['city_match'] = df['city_match'].str.join(', ')`. (I've added that to the first answer, for clarity.) **3.** Could you provide `data1` in a form that includes the strange results above? Running it on the data you provided yields normal results. — aneroid, Feb 15 '21 at 18:19
**4.** The part `map(str, l)) for l in df['city_match']` converts what's a _list_ in the dataframe cell into a string and then attempts to join it back (or something). Use the correct `join` as shown above or in the edited first answer and see if the problem occurs at all. May not need a larger dataset. P.S. If the problem persists, perhaps better to ask a new question :-) — aneroid, Feb 15 '21 at 18:25
**5.** Using the `re.findall()` instead of `df['city'].str.findall(...)` would allow you to use `set` on the result so that you only get unique cities. Or, you can leave the result to contain duplicate items in the list but then use `apply` on it in the next step and use [a function like in this answer](https://stackoverflow.com/a/7961390/1431750) to ensure only unique values. Do the convert-to-unique _before_ you convert to a string. — aneroid, Feb 15 '21 at 18:35

Create list based on column value and use that list to extract words from string column in df without overwriting row value with for loop

3 Answers3