2

I have this pandas dataframe

0  Tokens 
1: 'rice', 'XXX', '250g'
2: 'beer', 'XXX', '750cc'

All tokens here, 'rice', 'XXX' and '250g' are in the same list of strings, also in the same column

I want to remove the digits, and because it with another words, the digits cannot be removed.

I have tried this code:

def remove_digits(tokens):
    """
    Remove digits from a string
    """
    return [''.join([i for i in tokens if not i.isdigit()])]

df["Tokens"] = df.Tokens.apply(remove_digits)
df.head()

but it only joined the strings, and I clearly do not want to do that.

My desired output:

0  Tokens
1: 'rice' 'XXX' 'g'
2: 'beer', 'XXX', 'cc'
Alex
  • 6,610
  • 3
  • 20
  • 38
rnv86
  • 790
  • 4
  • 10
  • 22
  • What is `Tokens` here? Could you provide the sentences to construct the df? – Norhther Jul 11 '21 at 20:03
  • It is the column where my cleaned tokens are. – rnv86 Jul 11 '21 at 20:04
  • I think this answers your question by using regular expressions:https://stackoverflow.com/questions/40178364/using-regex-to-remove-digits-from-string – braulio Jul 11 '21 at 20:20
  • 1
    In your suggested solution, you are passing a list `Tokens` to your function, you need to then loop to each caracther in the string `i` before applying `isdigit()` – braulio Jul 11 '21 at 20:24

3 Answers3

2

This is possible using pandas methods, which are vectorised so more efficient that looping.

import pandas as pd

df = pd.DataFrame({"Tokens": [["rice", "XXX", "250g"], ["beer", "XXX", "750cc"]]})

col = "Tokens"
df[col] = (
    df[col]
    .explode()
    .str.replace("\d+", "", regex=True)
    .groupby(level=0)
    .agg(list)
)
#             Tokens
# 0   [rice, XXX, g]
# 1  [beer, XXX, cc]

Here we use:

Alex
  • 6,610
  • 3
  • 20
  • 38
0

Here's a simple solution -

df = pd.DataFrame({'Tokens':[['rice', 'XXX', '250g'], 
                             ['beer', 'XXX', '750cc']]})

def remove_digits_from_string(s):
    return ''.join([x for x in s if not x.isdigit()])

def remove_digits(l):
    return [remove_digits_from_string(s) for s in l]

df["Tokens"] = df.Tokens.apply(remove_digits)

ShlomiF
  • 2,686
  • 1
  • 14
  • 19
0

You can use to_list + re.sub in order to update your original dataframe.

import re

for index, lst in enumerate(df['Tokens'].to_list()):
  lst = [re.sub('\d+', '', i) for i in lst]
  df.loc[index, 'Tokens'] = lst

print(df)

Output:

    Tokens
0   [rice, XXX, g]
1   [beer, XXX, cc]
Carmoreno
  • 1,271
  • 17
  • 29