0

I have a dataframe of various wines. I am trying to remove all punctuation, all words containing 4 or fewer characters, as well as the words flavors, aromas, finish, and drink from the string values contained in the 'description' column. My code does not appear to be working and I have also tried various permutations of this to no avail.

remove_list = ['[^\w\s]', '[\b(\w{1,4})\b]', 'flavors', 'aromas', 'finish', 'drink']

df11['description'].str.replace('|'.join(remove_list), '', regex=True)

Chris
  • 15,819
  • 3
  • 24
  • 37
  • 2
    do you have some sample data and what is your expected output? you can also read ([How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example)) – 99_m4n Aug 03 '22 at 18:46
  • This look right.. For starters I'd test out the individual replace elements and see how that acts – rayad Aug 03 '22 at 18:58

1 Answers1

0

I think you are missing r to avoid escape characters in your regex pattern. read more

try:

remove_list = [r'[^\w\s]', r'\b\w{1,3}\b', 'flavors', 'aromas', 'finish', 'drink']

to replicate everything:

import pandas as pd
# create data
data = {'description': ["I don't like this wine. And flavors are really bad."]}
df11 = pd.DataFrame(data)
print(df11)

remove_list = [r'[^\w\s]', r'\b\w{1,3}\b', 'flavors', 'aromas', 'finish', 'drink']

df11['description'].replace('|'.join(remove_list), '', regex=True)

output is:

enter image description here

mmustafaicer
  • 434
  • 6
  • 15
  • This could use more explanation as to _why_ you think this should work. – renefritze Aug 04 '22 at 07:59
  • Sure, let me paste my replication code when I tested this. I think OP forgot to make it r string to avoid escape characters. But the question definitely should be improved with input and desired output and what are the problems. – mmustafaicer Aug 04 '22 at 14:01