0

I have a function in python that split a sentence into words using tokenizer. The Problem is that when i run this function the output returned is one word with no spaces.

  • actual sentence:

'is lovin Picture2Life.com!!! Y all fun apps r for iphone and not blackberry??!! '

  • result:

'islovinpicturelifecomyallfunappsrforiphoneandnotblackberry'

where the result must be like this: is loving picture 2 life . com....

code:

ppt = '''...!@#$%^&*()....{}’‘ “”  “[]|._-`/?:;"'\,~12345678876543'''

#tekonize helper function
def text_process(raw_text):
    '''
    parameters:
    =========
    raw_text: text as input
    functions:
    ==========
    - remove all punctuation
    - remove all stop words
    - return a list of the cleaned text

    '''
    #check characters to see if they are in punctuation
    nopunc = [char for char in list(raw_text) if char not in ppt]

    
    
    # join the characters again to form the string
    nopunc = "".join(nopunc)
    
    #now just remove ant stopwords
     
    words = [word for word in nopunc.lower().split() if   word.lower() not in stopwords.words("english")]
    return words

ddt= data.text[2:3].apply(text_process)
print("example: {}".format(ddt))
cs95
  • 379,657
  • 97
  • 704
  • 746
Pyleb Pyl3b
  • 183
  • 3
  • 16
  • Seems to be coming up often, you can read about quickly removing crud from strings using "translate": https://stackoverflow.com/questions/50444346/fast-punctuation-removal-with-pandas – cs95 Jul 14 '20 at 11:57
  • what about tokenize the sentence ? – Pyleb Pyl3b Jul 14 '20 at 12:43
  • https://stackoverflow.com/questions/48049087/nltk-based-text-processing-with-pandas/48049425#48049425 – cs95 Jul 14 '20 at 15:15

1 Answers1

0

Well, in your first line

ppt = '''...!@#$%^&*()....{}’‘ “”  “[]|._-`/?:;"'\,~12345678876543'''

you include the white space character in the ‘ “” “ sequence, so it's removing all whitespace (and therefore spaces) when it runs the list comprehension:

nopunc = [char for char in list(raw_text) if char not in ppt]