How can I split a line in Python at a non-printing ascii character (such as the long minus sign hex 0x97 , Octal 227)? I won't need the character itself. The information after it will be saved as a variable.
3 Answers
You can use re.split
.
>>> import re
>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
Adjust the pattern to only include the characters you want to keep.
See also: stripping-non-printable-characters-from-a-string-in-python
Example (w/ the long minus):
>>> # \xe2\x80\x93 represents a long dash (or long minus)
>>> s = 'hello – world'
>>> s
'hello \xe2\x80\x93 world'
>>> import re
>>> re.split("\xe2\x80\x93", s)
['hello ', ' world']
Or, the same with unicode:
>>> # \u2013 represents a long dash, long minus or so called en-dash
>>> s = u'hello – world'
>>> s
u'hello \u2013 world'
>>> import re
>>> re.split(u"\u2013", s)
[u'hello ', u' world']
-
How do I specify that I want to split exactly at hex character 97? – d-cubed May 29 '10 at 18:46
-
1-1 (0) The OP has an EM DASH (U+2014, cp1252 x97), not an EN DASH (U+2013, cp1252 0x96). (1) Your second example is in terms of UTF-8 which obviously (??) the OP is not using (2) Using re.split instead of str.split is gross overkill. – John Machin May 30 '10 at 22:06
_, _, your_result= your_input_string.partition('\x97')
or
your_result= your_input_string.partition('\x97')[2]
If your_input_string
does not contain a '\x97'
, then your_result
will be empty. If your_input_string
contains multiple '\x97'
characters, your_result
will contain everything after the first '\x97'
character, including other '\x97'
characters.

- 92,761
- 29
- 141
- 204
Just use the string/unicode split method (They don't really care about the string you split upon (other than it is a constant. If you want to use a Regex then use re.split)
To get the split string either escape it like the other people have shown "\x97"
or
use chr(0x97) for strings (0-255) or unichr(0x97) for unicode
so an example would be
'will not be split'.split(chr(0x97))
'will be split here:\x97 and this is the second string'.split(chr(0x97))

- 834
- 6
- 9
-
-
(0) You mean str/unicode split method (1) "other than it is a constant": It can be any expression that evaluates to a single string (like, for example, `chr(0x97)`) (2) using `[uni]chr(0x97) instead of [u]"\x97"` is obfuscatory/redundant/wasteful/deprecable (IMHO) -- would you write `float("1.23")` instead of `1.23`?? (3) If operating in unicode, he wouldn't need `unichr(0x97)`, he would need `u"\u2014"`, which is `"\x97".decode("cp1252")` – John Machin May 30 '10 at 21:58
-
(0) In my *english* explanation do I really have to specify that it is the *str* method rather than a method that operates on a string... which **is** the str class??? (1) it is a constant was referring to the string couldn't specify more than one string (chr(97) will always be '\x97') where as an re.split could handle '\x97|\x91'. **OF COURSE** you could write chr(i) where i is a variable which can change. (2) Yes... of course you wouldn't do a float conversion, but chr maybe useful if he needed to convert a number into a string **at runtime**. – Terence Honles May 30 '10 at 22:45
-
(3) And no I didn't check what 0x97 was in unicode... why should I? he asked for 0x97... I gave that to him. It's up to him to figure out that character hex values in ASCII are different than in unicode (I was merely showing that there *was* an equivalent that would generate a unicode character string) – Terence Honles May 30 '10 at 22:46
-
(0) a string is an instance of the str type OR the unicode type (1) "constant" != "only one string" (3) You shouldn't need to "check what 0x97 was in unicode" ... characters in the range U+0080 to U+009F are C1 control characters, nothing to do with dashes. If you have them in your unicode data, you are either working with some ancient/arcane protocol (prob=0.001) or some wally has decoded using latin1 instead of cp1252 (prob=0.999). The first 128 Unicode characters were deliberately made same as ASCII; "character hex values in ASCII" are NOT "different than in unicode". 0x97 isn't in ASCII. – John Machin May 31 '10 at 01:24
-
I still think you are a *little* too picky about the string thing, and I was a *little* wary about putting "constant" when I wrote it (I thought with context it was obvious that it was not an re). Well thank you for your knowledge about unicode... I haven't really used it much. And finally I was afraid you were going to bring that up (I was not 100% sure about the mapping of characters from ASCII to unicode)... and about 0x97, it **is** if you are using extended ASCII (which I was including when I wrote the comment because I had already written over one comment worth) – Terence Honles May 31 '10 at 03:19
-
Unicode knowledge: you may like to read http://www.amk.ca/python/howto/unicode and the references (especially the articles by Czyborra, Spolsky and Orendorff). "extended ASCII" is not a very technical description and is meaningless without specifying somehow (e.g. by naming an encoding) what codepoints 0x80 to 0xFF mean. I don't understand your reason for using "ASCII" when you meant "extended ASCII" ("because I had already written over one comment worth"). – John Machin May 31 '10 at 05:57
-
I apologize if I described the character wrong. cat -e showed 'M-^W' which gave the octal value of 227 which googling I found was equivalent to: U+0097, character —, decimal 151, hex 0x97, octal \227, binary 10010111 – d-cubed May 31 '10 at 18:40
-
@Donnied: You didn't describe the character wrongly; you gave enough info (long minus sign). However you are NOW going wrong; U+0097 is (as I wrote above) a control character, not a dash/minus. cat -e? What's that? In the Python 2.x context, `print repr(your_data)` shows unambiguously and portably what you have; try using it when you ask your next question. – John Machin Jun 01 '10 at 01:18
-
"cat -e" is a Linux command that I was using to see what was getting munged. The "U+0097" bit was what googling for the significance of octal \227. u0097 is a control character - why was it listed as an equivalent? Thanks for the heads up. – d-cubed Jun 01 '10 at 02:00
-
"""u0097 is a control character - why was it listed as an equivalent?""" Please don't be shocked: Some people who write articles that you can find with google are a few sandwiches short of a picnic :-) See my comments on "extended ASCII" and encoding above. Also read the articles that I recommended to Terence. – John Machin Jun 02 '10 at 04:10