Assuming you have column Text
in dataframe df
, you can try:
df2 = df['Text'].str.split().explode()
m = df2.str.contains(r'[A-Za-z]') & df2.str.contains(r'\d')
df_out = df2[~m].groupby(level=0).agg(' '.join)
df_out = df_out.to_frame(name='Text')
Explanation
We split the text into separate words then explode the list of words into multiple rows with one word in one row. Then we test whether the word contains any alpha character(s) and digit(s) by regex by using .str.contains()
as follows:
.str.contains(r'[A-Za-z]') # test any character in [A-Za-z] in string
and
.str.contains(r'\d') # test any numeric digit in string
Then with the boolean mask m
of the alpha and digit tests, we select only those row entries that does not contains both the alpha and digits by:
df2[~m]
Then, we assemble the filtered words (without alphanumeric words) back to a sentence by using
groupby(level=0).agg(' '.join)
Here, we group by level=0
which is the original row index before explode (i.e. original row number).
Demo
data = {'Text': ['2fvRE-Ku89lkRVJ44QQFN ABACUS LABS, INC', 'abc123 CAT LABS, INC']}
df = pd.DataFrame(data)
Text
0 2fvRE-Ku89lkRVJ44QQFN ABACUS LABS, INC
1 abc123 CAT LABS, INC
df2 = df['Text'].str.split().explode()
m = df2.str.contains(r'[A-Za-z]') & df2.str.contains(r'\d')
df_out = df2[~m].groupby(level=0).agg(' '.join)
df_out = df_out.to_frame(name='Text')
Text
0 ABACUS LABS, INC
1 CAT LABS, INC
Edit
We can also simplify it as:
df2 = df['Text'].str.findall(r'\b(?!.*[A-Za-z]+.*\d+)(?!.*\d+.*[A-Za-z]+.*).+\b').str.join(' ').str.strip()
Explanation
Here the regex we use is still to adhere to the requirement of excluding alphanumeric word. Regex:
r'\b(?!.*[A-Za-z]+.*\d+)(?!.*\d+.*[A-Za-z]+.*).+\b'
Within the word boundary \b
.... \b
, we use 2 negative lookaheads to check against both alpha and numeric characters. We need 2 negative lookaheads instead of one because it is possible the alpha can appear before the digit or vice versa.