How can i remove strings from sentences if string matches with strings in list

Question

I have a pandas.Series with sentences like this:

0    mi sobrino carlos bajó conmigo el lunes       
1    juan antonio es un tio guay                   
2    voy al cine con ramón                         
3    pepe el panadero siempre se porta bien conmigo
4    martha me hace feliz todos los días

on the other hand, I have a list of names and surnames like this:

l = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']

I want to match sentences from the series to the names in the list. The real data is much much bigger than this examples, so I thought that element-wise comparison between the series and the list was not going to be efficient, so I created a big string containing all the strings in the name list like this:

'|'.join(l)

I tried to create a boolean mask that later allows me to index the sentences that contains the names in the name list by true or false value like this:

series.apply(lambda x: x in '|'.join(l))

but it returns:

0    False
1    False
2    False
3    False
4    False

which is clearly not ok.

I also tried using str.contains() but it doesn't behave as I expect, because this method will look if any substring in the series is present in the name list, and this is not what I need (i.e. I need an exact match).

Could you please point me in the right direction here?

Thank you very much in advance

i dont know what panda is but you could use regex even tho it might be heavier — Aurele Collinet, Jul 22 '20 at 10:48
surely yes but i don't master regex, but if you provide a functional regex, i could give it a try :) — Miguel 2488, Jul 22 '20 at 10:50
@everyone Thanks to all of you guys!! I'm overwhelmed by all the support you provided. Thanks for your collaboration :D — Miguel 2488, Jul 22 '20 at 12:08

jezrael · Accepted Answer · 2020-07-22T11:05:06.937

3

If need exact match you can use word boundaries:

series.str.contains('|'.join(rf"\b{x}\b" for x in l))

For remove values by list is use generator comprehension with filtering only non matched values by splitted text:

series = series.apply(lambda x: ' '.join(y for y in x.split() if y not in l))
print (series)
                            
0                mi sobrino bajó conmigo el lunes
1                                  es un tio guay
2                           voy al cine con ramón
3  pepe el panadero siempre se porta bien conmigo
4             martha me hace feliz todos los días

edited Jul 22 '20 at 11:05

answered Jul 22 '20 at 10:54

jezrael

822,522
95
1,334
1,252

What does `rf` mean here? – bigbounty Jul 22 '20 at 10:57
@bigbounty - regex and f-string ;) – jezrael Jul 22 '20 at 10:57
1

Oh nice, I didn't know we could combine both. +1 – bigbounty Jul 22 '20 at 10:58
1

Thank you again Jezrael. Could you just provide a bit more of information about what your code does please? – Miguel 2488 Jul 22 '20 at 12:10
@Miguel2488 - Sure, give me some time. – jezrael Jul 22 '20 at 12:22
@Miguel2488 - For first what is word boundary check [this](https://stackoverflow.com/questions/1324676/what-is-a-word-boundary-in-regex). – jezrael Jul 22 '20 at 12:26
@Miguel2488 - For second - you can split values by whitespaces by `.split()` - default splitter is whitespace. Then is used comprehension for iterate each splitted value and tested if no membership with list by `not in`, so removed matched values from list. Last is used `join` for add space for values of filtered list. – jezrael Jul 22 '20 at 12:31

Aurele Collinet · Answer 2 · 2020-07-22T11:04:03.207

1

import re

data = ["mi sobrino carlos bajó conmigo el lunes", "juan antonio es un tio guay", "martha me hace feliz todos los días"]

regexs = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']

for regex in regexs:

    for sentence in data:

        if re.match(regex, sentence):

            print True
        
        else:

            print False

I guess something simple like that could work

cf : https://docs.python.org/fr/3/library/re.html

edited Jul 22 '20 at 11:04

answered Jul 22 '20 at 10:55

Aurele Collinet

138
1
16

And be careful with the encoding for spanish is UTF-8 i think – Aurele Collinet Jul 22 '20 at 10:56
for a full match just go for "^.*juan.*$" – Aurele Collinet Jul 22 '20 at 10:59

score 1 · Answer 3 · answered Jul 22 '20 at 10:59

Regex to check if a word at the start or at the end or in between

df = pd.DataFrame({'texts': [
                             'mi sobrino carlos bajó conmigo el lunes',
                             'juan antonio es un tio guay',
                             'voy al cine con ramón',
                             'pepe el panadero siempre se porta bien conmigo',
                             'martha me hace feliz todos los días '
                             ]})

names = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']

pattern = "|".join([f"^{s}|{s}$|\\b{s}\\b" for s in names])

df[df.apply(lambda x: 
            x.astype(str).str.contains(pattern, flags=re.I)).any(axis=1)]

score 1 · Answer 4 · answered Jul 22 '20 at 11:02

1

one option is set intersection:

l = set(['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos'])
s.apply(lambda x: len(set(x.split()).intersection(l))>0)

answered Jul 22 '20 at 11:02

Ezer K

3,637
3
18
34

wwnde · Answer 5 · 2020-07-22T11:17:02.430

For exact match. Try;

df.text.str.contains("|".join(l))

Otherwise, simply use regular expression to replace substring with ''. Call the substring using list comprehension

df.replace(regex=[x for x in l], value='')
                          

                                   text
0               mi sobrino  bajó conmigo el lunes
1                                  es un tio guay
2                           voy al cine con ramón
3  pepe el panadero siempre se porta bien conmigo
4             martha me hace feliz todos los días

score 1 · Answer 6 · answered Jul 22 '20 at 11:39

If you want a little more flexibility for processing, you can have your custom exact_match function as below:

import re 

def exact_match(text, l=l):
    return bool(re.search('|'.join(rf'\b{x}\b' for x in l), text))

series.apply(exact_match)

Output:

0     True
1     True
2    False
3    False
4    False
dtype: bool

How can i remove strings from sentences if string matches with strings in list

6 Answers6