0

str.replace method not working within a user defined function when map to pandas series; however, it works on a normal function call. Why?

RE_EMOJI = re.compile('[\U00010000-\U0010ffff]', flags=re.UNICODE)
def clean_story(x):

    x = x.strip() # trims leading and trailing whitespace only
    x = x.replace('\r', '').replace('\n', '') # remove all 3 types of line breaks
    translator = str.maketrans('','',string.punctuation) # strip punctuation, function
    x = x.translate(translator) # strip punctuation
    x = RE_EMOJI.sub(r'', x) # remove emojis
    x = re.sub(r'http\S+', '', x, flags=re.MULTILINE) # remove url http      
    x = re.sub(r'\(http\S+', '', x, flags=re.MULTILINE) # remove url (http
    x = re.sub(r'@\S+\s', '', x, flags=re.MULTILINE) # remove words starting with @
    x = re.sub(' +', ' ', x) # remove duplicated space
    x = x.lower() # lower case

    return x

x is an example text

x = " Overview \n Using \r Home #; automation: to- control applications \r is cool, but we can further leverage its power to simplify our daily tasks like checking gas, geysers, heaters, AC temperature etc. before leaving home. \n Scope \n The server (Raspberry pi 2 running on Windows 10) is running at home. Family members will stay connected through their smart devices like Mobile phones with Windows Operating systems or Android. The server will be connected to gas, heat and smoke detection sensors. And will push notifications to all the clients (family members), on any changes detected. The server will push notifications to all the clients using Microsoft Azure- Notification Hubs. \n Components and supplies Software and applications Summary \n "

Use this text as an example: calling clean_story(x) results in the proper removal of all "\n"

'overview using home automation to control applications is cool but we can further leverage its power to simplify our daily tasks like checking gas geysers heaters ac temperature etc before leaving home scope the server raspberry pi 2 running on windows 10 is running at home family members will stay connected through their smart devices like mobile phones with windows operating systems or android the server will be connected to gas heat and smoke detection sensors and will push notifications to all the clients family members on any changes detected the server will push notifications to all the clients using microsoft azure notification hubs components and supplies software and applications summary'

However, when placing the same text in a pandas dataframe and map the function to the series containing the text... all the methods seem to work except the str.replace... I'm left with a bunch of "n" (note: same as "\n" because later method removes punctuation)

df_story['clean_story'] = df_story['story'].map(clean_story)

result

'overview n using home automation to control applications is cool but we can further leverage its power to simplify our daily tasks like checking gas geysers heaters ac temperature etc before leaving home n scope n the server raspberry pi 2 running on windows 10 is running at home family members will stay connected through their smart devices like mobile phones with windows operating systems or android the server will be connected to gas heat and smoke detection sensors and will push notifications to all the clients family members on any changes detected the server will push notifications to all the clients using microsoft azure notification hubs n components and supplies software and applications summary n'

  • Your Pandas series doesn't have newlines in it, it has literal backslashes followed by letter n's. Since we can't see how you created the DataFrame, it's impossible to tell you why it's wrong, or how to fix it, but that's the problem, and you need to fix it at the source. If I had to take a wild guess, you're building the DataFrame by storing the `repr`s of some strings, rather than the strings themselves. – abarnert Jun 27 '18 at 22:47
  • thanks for your response! yes, i understand "\n" are literal backslack substrings. I tried following your train of thought in the function call by assigning x = str(x) first then doing the processing; however, same undesired result: method not working. Also, I checked the df.info which shows: story 9998 non-null object. is there some other test i should do? – Ryan Runchey Jun 27 '18 at 23:05
  • `x = str(x)` isn't going to do anything when `x` is a string. It's just the _wrong_ string. (If my guess is right: the `repr` of a string is another string, because the `repr` of _anything_ is a string.) – abarnert Jun 27 '18 at 23:17
  • 1
    As I said before, you clearly did something wrong in building the DataFrame, but nobody can debug that for you if you don't show us how you built the DataFrame. – abarnert Jun 27 '18 at 23:18
  • Resolved when doing the mapping by making it "\\n" instead of just "\n" which will remove the "\n" substring in question. I don't understand why though. x = x.replace("\\r", "").replace("\\n", "") # remove all 3 types of line breaks – Ryan Runchey Jun 27 '18 at 23:29
  • 1
    You haven't really solved the problem, just worked around it. You expected actual newlines in your strings. But somehow, when you put them in your DataFrame, you got a backslash and an n in place of each newline. Rather than figure out how you broke your data, you've just added a workaround that deals with the breakage. That's rarely the right answer. For example, your "cleanup" might well silently turn, e.g, a `㐀` character into `u3400`, which you might not discover until weeks after you'd written and forgotten the code and generated gigabytes of output. – abarnert Jun 28 '18 at 00:22

0 Answers0