0

I have a text that is tokenized. Inside there numbers like 1.2, 2.3, etc I used the following codes to remove them but they do not work

train_vs['doc_text'] = train_vs['doc_text'].apply(lambda x: [c for c in x if not c.isnumeric()])
train_vs['doc_text'] = train_vs['doc_text'].apply(lambda x: [c for c in x if not c.isdigit()])  

Any help on how to remove these digits? Thanks

Talal Ghannam
  • 189
  • 2
  • 17

1 Answers1

0

.apply is a method of both pd.Series and pd.DataFrame and you are calling it on a Series. The upshot here is that every x for your lambda is a value in your dataframe. If you really have a tokenized list in each value of your Series then I'm not sure that's ideal.

Anyway, isdigit and isnumeric can't check for floats out of the box. A silly workaround could look like:

df = pd.DataFrame(
    {
    'smple':[
    ["12.34", "atrium"],
    ["12.34", "atrium"],["election", "foible"],
    ['USA', "2131244213213"]
        ]
    }
)


df.smple.apply(
    lambda x: [c for c in x if not (c.isnumeric() or c.replace('.','',1).isdigit())]
)

This thread would be helpful reading for you I think.

Charles Landau
  • 4,187
  • 1
  • 8
  • 24
  • hi,Thanks a lot for your help. unfortunately though, your code worked for some numbers, but not all of them. I still have numbers like 1.23, -3.45.. etc. especially negative numbers. – Talal Ghannam Feb 03 '19 at 17:25