3

I'm having trouble getting a replace() to work

I've tried my_string.replace('\\', '') and re.sub('\\', '', my_string), but neither one works.

I thought \ was the escape code for backslash, am I wrong?

The string in question looks like

'<2011315123.04C6DACE618A7C2763810@\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>'

or print my_string <2011315123.04C6DACE618A7C2763810@???ꂩ?猩???邾?낤>

Yes, it's supposed to look like garbage, but I'd rather get '<2011315123.04C6DACE618A7C2763810@82b182ea82a982e78ca982a682e982be82eb82a4>'

Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
Joshua Olson
  • 3,675
  • 3
  • 27
  • 30
  • Related: http://stackoverflow.com/questions/92438/stripping-non-printable-characters-from-a-string-in-python – icktoofay Apr 24 '11 at 00:56
  • That doesn't really help. I want my string to only contain ascii character, but I don't want to completely stripout the non-ascii characters, just make them ascii literals. – Joshua Olson Apr 24 '11 at 01:04
  • I want the ascii because it GREATLY simplifies the regex search string I can use. I can check for \@[\w\.]+\ and be done with it, because I know if I get a ']', '>', ' ' or anything of the sort my domain name is finished. – Joshua Olson Apr 25 '11 at 07:54

2 Answers2

8

You don't have any backslashes in your string. What you don't have, you can't remove.

Consider what you are showing as '\x82' ... this is a one-byte string.

>>> s = '\x82'
>>> len(s)
1
>>> ord(s)
130
>>> hex(ord(s))
'0x82'
>>> print s
é # my sys.stdout.encoding is 'cp850'
>>> print repr(s)
'\x82'
>>>

What you'd "rather get" ('x82') is meaningless.

Update The "non-ascii" part of the string (bounded by @ and >) is actually Japanese text written mostly in Hiragana and encoded using shift_jis. Transcript of IDLE session:

>>> y = '\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4'
>>> print y.decode('shift_jis')
これから見えるだろう

Google Translate produces "Can not you see the future" as the English translation.

In a comment on another answer, you say:

I just need ascii

and

What I'm doing with it is seeing how far apart the two strings are using nltk.edit_distance(), so this will give me a multiple of the true distance. Which is good enough for me.

Why do you think you need ASCII? Edit distance is defined quite independently of any alphabet.

For a start, doing nonsensical transformations of your strings won't give you a consistent or predicable multiple of the true distance. Secondly, out of the following:

x
repr(x)
repr(x).replace('\\', '')
repr(x).replace('\\x', '') # if \ is noise, so is x
x.decode(whatever_the_encoding_is)

why do you choose the third?

Update 2 in response to comments:

(1) You still haven't said why you think you need "ascii". nltk.edit_distance doesn't require "ascii" -- the args are said to be "strings" (whatever that means) but the code will work with any 2 sequences of objects for which != works. In other words, why not just use the first of the above 5 options?

(2) Accepting up to 100% inflation of the edit distance is somwhat astonishing. Note that your currently chosen method will use 4 symbols (hex digits) per Japanese character. repr(x) uses 8 symbols per character. x (the first option) uses 2.

(3) You can mitigate the inflation effect by normalising your edit distance. Instead of comparing distance(s1, s2) with a number_of_symbols threshold, compare distance(s1, s2) / float(max(len(s1), len(s2))) with a fraction threshold. Note normalisation is usually used anyway ... the rationale being that the dissimilarity between 20-symbol strings with an edit distance of 4 is about the same as that between 10-symbol strings with an edit distance of 2, not twice as much.

(4) nltk.edit_distance is the most shockingly inefficient pure-Python implementation of edit_distance that I've ever seen. This implementation by Magnus Lie Hetland is much better, but still capable of improvement.

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • Yeah, I figured that out after pulling it up in a texteditor. I was getting repr and print representations of the character. Thanks. – Joshua Olson Apr 24 '11 at 01:12
  • @Joshua Olson: The first edition of my answer answered your question correctly. The fact that you want to do something else has nothing to do with whether you should accept my answer. – John Machin Apr 24 '11 at 03:00
  • The problem is I don't know what the encoding (SPAM messages which are the source of the strings aren't often well formed) is and I need some representation of them (Yes, the x is garbage too, I ended up stripping both the \ and x out just keeping the hex of the letter, your 4th example) to compare in edit_distance and if I have a string of hex numbers I can compare their distance just as well as if I was using the decoded string. If you know of a way of identifying the encoding based on a handful of character that's as straight forward as `repr(x).replace('\\x', '')` then I'd use it. – Joshua Olson Apr 24 '11 at 04:34
  • I've accepted your answer since it now covers the explanation and what I was looking for. I wish there were a better solution, but without knowing the encoding I'm stuck with doing it this way. Some of my data doesn't even have a domain name and that's causing me all kinds of other headaches as far as how to handle it without throwing my numbers off completely. – Joshua Olson Apr 24 '11 at 04:49
  • Taking the hex values (minus the \x) of the characters should give me an edit distance between 1.0 ~ 2.0 of the true edit distance, especially when both strings are transformed in this way. Yes, leaving the using '\\' instead of '\\x' wouldn't make as much sense, but it wouldn't do much harm either since both strings would be transformed in the same way. – Joshua Olson Apr 24 '11 at 04:59
  • @Joshua Olson: In general, guessing the encoding would be expensive and unreliable, given the shortness of the strings. `chardet` works well when presented with UTF-8, with encodings used with the Cyrillic script, and with encodings used with Chinese, Japanese, and Korean -- it identifies your guff as shift_jis -- but coverage is otherwise patchy. For other issues, see my updated answer. – John Machin Apr 24 '11 at 22:51
  • Basically it comes down to the fact that this calculation is a VERY small part of my project and it isn't worth spending a lot of time to solve efficiently and accurately. I could spend hour researching how to deal build a much more complex regex search string to deal with all possible variaions of domain names (rather than just [\w\.]) and implementing a more efficient version of edit_distance (while I'm at it why not implement Damerau–Levenshtein distance), or I could spend ten minutes to do it this way and use my time doing more analysis on the data. – Joshua Olson Apr 25 '11 at 07:48
  • Also, the edit_distance is used in a ratio against the length of the "domain name", so the inflation is neutralized by that fact. If the edit distance is 100% inflated then so it the length of the domain name. It all comes down to practicality of the solution. And it just wasn't WORTH spending the time to do it more efficiently for something that will get used/calculated MAYBE 700 times per batch (2+ weeks). – Joshua Olson Apr 25 '11 at 07:50
2

This works i think if you really want to just strip the "\"

>>> a = '<2011315123.04C6DACE618A7C2763810@\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>'
>>> repr(a).replace("\\","")[1:-1]
'<2011315123.04C6DACE618A7C2763810@x82xb1x82xeax82xa9x82xe7x8cxa9x82xa6x82xe9x82xbex82xebx82xa4>'
>>> 

But like the answer above, what you get is pretty much meaningless.

dting
  • 38,604
  • 10
  • 95
  • 114
  • Well, sometimes there is a good reason someone wants to do something, that I cant come up with. I just offered a solution with a warning... – dting Apr 24 '11 at 01:13
  • Wait. That might be the exact solution I'm looking for. I know it's nonsense, but I just need ascii that I can parse in a consistent way with another part of the same string (From and Message-ID fields of spam messages). What I'm doing with it is seeing how far apart the two strings are using nltk.edit_distance(), so this will give me a multiple of the true distance. Which is good enough for me. – Joshua Olson Apr 24 '11 at 01:35
  • What is "this" that amuses you so much? – John Machin Apr 24 '11 at 02:54