Here is a solution and some thoughts about this:
A working solution if, as in your example, the strings are always separated by whitespaces in Text
:
import pandas as pd
df = pd.DataFrame({'Text': ['A++ python', 'Teapot warmeR'],})
languages = ["Python", "R", "A++", "TEA"]
# Extracting column as list and convert to lower case
text_col = df['Text'].tolist()
text_col = [x.lower() for x in text_col]
# To lower case too
languages = [x.lower() for x in languages]
# Finding "whole words"
to_add = [lang for lang in languages for langs_list in text_col if lang in langs_list.split(" ")]
# Adding columns
for lang in to_add:
df[lang] = pd.Series(dtype='int')
print(df)
Output:
Text python a++
0 A++ python NaN NaN
1 Teapot warmeR NaN NaN
Thoughts:
In fact this is an interesting multi-causal problem.
1st cause: "A++" ends with 2 plus signs which are regex special characters that need to be escaped
2nd: You need to find whole words, so we should use regex boudaries \b "as usual" but:
3rd: \b will match "Python", but \b won't work after the plus sign (a non-word character) of "A++" and the whitespace after it because \b is a zero width match that will match between a word character (\w) and a non-word character (\W) or between a word character and the start of end of string.
4th: We could replace the ending \b with \B, and the the regex will match "A++" because \B is \b negated. But this time, it will not match "Python" anymore and it will match "TEA"...
We could analyse this like that :
Here is the "final" (non-working) code and after that an explanation of the steps followed to write it:
for lang in languages:
if lang not in df.columns:
needle = re.escape(lang)
needle = r'\b{}\B'.format(needle)
if df['Text'].str.lower().str.contains(needle, case=False, regex=True).any():
df[lang] = pd.Series(dtype='int')
- For clarity, we use
case=False
and remove .str.lower()
and lang.lower()
- We set
regex=True
in order to use regex to match whole words. But as is, the regex will fail becasue "A++" needs to be escaped.
- We escape the strings with
needle = re.escape(lang)
. But now we get substrings: Pyton R, A++ and TEA.
- So we use word boundary
\b
: needle = r'\b{}\b'.format(needle)
. But now we only get Python...
- So we use word boundary
\B
at the end: needle = r'\b{}\B'.format(needle)
. Now, we get A++, but this does not match Python anymore and we also get TEA...
To conclude we can't use a simple regex that will work with all cases. BUT you can use a complex regex (adaptive word boundaries from https://stackoverflow.com/a/45145800/3832970) as in the answer of @Wiktor Stribiżew.
And, if, as in your example, the strings are always separated by whitespaces in Text
, we could split on whitespaces and check if the whole words are in the resulting lists using in
operator.