0

i have a pandas dataframe with a column with long text called description. The data from this comes from the jira web instance. I've been trying to get rid of markup in the text using several different methods but none seem to do the trick to remove \r\n\xa0.

Here's what I have so far

        df['description'] = df['description'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)
        df['description'] = df['description'].replace(r'[^\x00-\x7F]+', ' ', regex = True)
        df['description'] = df['description'].replace(r'\[(.+)\]\([^\)]+\)', r'\1', regex = True).replace(r'\*\*([^*]+)\*\*', r'\1',                                                                                  regex = True)
        df['description'] = df['description'].replace(r'\*([^*]+)\*',r'\1', regex = True )
        df['description'] = df['description'].astype(str).str.strip()

Any ideas what I can do here? sample of text

We analyzed found the issue in Garbage Collection which crashed the JVM.\r\n\r\n\xa0\r\n\r\n\xa0\r\n\r\n_Stack: [0x00007f0b58ff1000,0x00007f0b590f1000],\xa0 sp=0x00007f0b590ef120,\xa0 free space=1016k_\r\n\r\n_Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)_\r\n\r\n_V\xa0 [libjvm.so+0x8b9e4f]\xa0 MethodData::clean_extra_data(BoolObjectClosure)+0x1cf_\r\n\r\n_V\xa0 [libjvm.so+0x63c582]\xa0 
TirzaRuth
  • 41
  • 1
  • 6
  • https://stackoverflow.com/questions/10993612/python-removing-xa0-from-string – Eric Truett Apr 22 '20 at 00:06
  • Could you please provide a sample of the text – wwnde Apr 22 '20 at 00:17
  • added a sample of the text – TirzaRuth Apr 22 '20 at 00:29
  • Hi, can you give what would be the expected output from that sample? – EvensF Apr 22 '20 at 02:40
  • We analyzed the JAVA heap dump and found the issue in Garbage Collection which crashed the JVM. Stack: [0x00007f0b58ff1000,0x00007f0b590f1000], sp=0x00007f0b590ef120, free space=1016k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x8b9e4f] MethodData::clean_extra_data(BoolObjectClosure*)+0x1cf – TirzaRuth Apr 22 '20 at 11:25

1 Answers1

1

This should capture those if your string isn't raw:

pattern = r'(\r)|(\n)|(\xa0)'

Otherwise, use this:

pattern = r'(\\r)|(\\n)|(\\xa0)'
Ramin Melikov
  • 967
  • 8
  • 14
  • thanks i tried both of what you suggested above - its still not being removed. – TirzaRuth Apr 22 '20 at 11:26
  • I tried copying the text into another dataframe and then when I apply the 2nd pattern you have , it removes it completely. But I also noticed that when I copy this text, it does not come up with the \xa0 in it. Is this something to do with the xa0? My original text comes from a web application that I am downloading through a rest api – TirzaRuth Apr 22 '20 at 12:40
  • @TirzaRuth Also, I think between `df['description']` and `.replace()` you should insert `.str`. – Ramin Melikov Apr 23 '20 at 23:24
  • Yes - that was it - needed to be string first, works now, thanks! – TirzaRuth Apr 24 '20 at 13:13
  • @TirzaRuth you're welcome. i'd appreaciate a vote up and the best answer pick. – Ramin Melikov Apr 25 '20 at 00:38