6

I have a string that is a sentence like I don't want it, there'll be others

So the text looks like this I don\'t want it, there\'ll be other

for some reason a \ comes with the text next to the '. It was read in from another source. I want to remove it, but can't. I've tried. sentence.replace("\'","'")

sentence.replace(r"\'","'")

sentence.replace("\\","")

sentence.replace(r"\\","")

sentence.replace(r"\\\\","")

I know the \ is to escape something, so not sure how to do it with the quotes

jason
  • 3,811
  • 18
  • 92
  • 147

4 Answers4

9

The \ is just there to escape the ' character. It is only visible in the representation (repr) of the string, it's not actually a character in the string. See the following demo

>>> repr("I don't want it, there'll be others")
'"I don\'t want it, there\'ll be others"'

>>> print("I don't want it, there'll be others")
I don't want it, there'll be others
Cory Kramer
  • 114,268
  • 16
  • 167
  • 218
  • this doesn't help me, because I feed the string through `nltk` and it thinks `don` is a separate word, cutting off the word `don't` – jason Oct 16 '15 at 11:56
  • i think this is a `nltk` problem then, thanks for the help – jason Oct 16 '15 at 12:06
  • 1
    It's not an nltk "problem". The backslashes are how python is showing you that the string doesn't end at the apostrophe, as everyone has said. The usual NLTK tokenization intentionally breaks up words at the apostrophe; this has nothing to do with the backslashes. – alexis Oct 16 '15 at 20:49
2

Try to use:

sentence.replace("\\", "")

You need two backslashes because first of them act as escape symbol, and second is symbol that you need to replace.

Eugene Soldatov
  • 9,755
  • 2
  • 35
  • 43
1

It is better to use regular expression to remove backslash:

>>> re.sub(u"u\005c'", r"'", "I don\'t want it, there\'ll be other")
"I don't want it, there'll be other"
Mayur Koshti
  • 1,794
  • 15
  • 20
0

If your text comes from crawled text and you didn't clean it up by unescaping before you process it with NLP tools, then you could easily unescape the HTML markups, e.g.:

In python2.x:

>>> import sys; sys.version
'2.7.6 (default, Jun 22 2015, 17:58:13) \n[GCC 4.8.2]'
>>> import HTMLParser
>>> txt = """I don\'t want it, there\'ll be other"""
>>> HTMLParser.HTMLParser().unescape(txt)
"I don't want it, there'll be other"

In python3:

>>> import sys; sys.version
'3.4.0 (default, Jun 19 2015, 14:20:21) \n[GCC 4.8.2]'
>>> import html
>>> txt = """I don\'t want it, there\'ll be other"""
>>> html.unescape(txt)
"I don't want it, there'll be other"

See also: How do I unescape HTML entities in a string in Python 3.1?

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738