4

I am trying to split a string in python into a list of characters. I know that there are a lot of ways to do this in python, but I have a case where those methods don't give me the desired results.

The problem happens when I have special characters like '\t' that is explicitly written in the string (and I don't mean the real tab).

Example:

string = "    Hello \t World."

the output I need is:

list_of_chars = [' ', ' ', ' ', ' ', 'H', 'e', 'l', 'l', 'o', ' ', '\', 't', ' ', 'W', 'o', 'r', 'l', 'd', '.']

but when I use the methods that are given in this question, I get a list that contains '/t' as whole string - not separated.

Example:

> list(string)
> ['H', 'e', 'l', 'l', 'o', 'w', ' ', '\t', ' ', 'W', 'o', 'r', 'l', 'd', '.']

I want to know why this happens and how to get what I want.

Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126
atefsawaed
  • 543
  • 8
  • 14
  • 1
    `string = [x for x in r" Hello \t World."]` is the closest you'll get. – Abdou Jan 01 '18 at 19:42
  • The output you need is impossible. That's a syntax error. – Stefan Pochmann Jan 01 '18 at 19:43
  • 1
    If you typed `" Hello \t World."` in Python, this is the real tab. A string containing backslash-t would be either `r" Hello \t World."` or `" Hello \\t World."`. Do you have the string in the code or are you reading a file?... – dividebyzero Jan 01 '18 at 19:46
  • @dividebyzero I am reading a source code file of a high-level like language, and it have strings like the one I mentioned. – atefsawaed Jan 01 '18 at 19:51

4 Answers4

6

You can substitute your string accordingly:

import itertools
txt = "    Hello \t World."

specials = { 
    '\a' : '\\a', #     ASCII Bell (BEL)
    '\b' : '\\b', #     ASCII Backspace (BS)
    '\f' : '\\f', #     ASCII Formfeed (FF)
    '\n' : '\\n', #     ASCII Linefeed (LF)
    '\r' : '\\r', #     ASCII Carriage Return (CR)
    '\t' : '\\t', #     ASCII Horizontal Tab (TAB)
    '\v' : '\\v'  #     ASCII Vertical Tab (VT)
}

# edited out: # txt2 = "".join([x if x not in specials else specials[x] for x in txt])
txt2 = itertools.chain(* [(list(specials[x]) if x in specials else [x]) for x in txt])

print(list(txt2))

Output:

[' ', ' ', ' ', ' ', 'H', 'e', 'l', 'l', 'o', ' ', '\\', 't', ' ', 'W', 
 'o', 'r', 'l', 'd', '.'] 

The list comprehension looks more "positive" and uses list(itertools.chain(*[...])) instead of list("".join([...])) which should be more performant.

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • @vaultah Cool. Only new about `[key]` and `.keys` - a `.get()` with a default - should have thought it would be provided. Found it here: https://docs.python.org/3/library/stdtypes.html#dict. I read about the ["flatten inner lists"](https://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python) syntax but cant get my head around it. Thanks for commenting - gonna read up on translate next. – Patrick Artner Jan 02 '18 at 09:10
5

You should take a look at String Literal document, which says:

The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character. String literals may optionally be prefixed with a letter r' orR'; such strings are called raw strings and use different rules for backslash escape sequences.

In your example string, \t are not two characters but a single character which represents ASCII Horizontal Tab (TAB).

In order to tell your Python interpreter that these two are separate character, you should be using raw string (using r before string "")as:

>>> list(r"    Hello \t World.")
[' ', ' ', ' ', ' ', 'H', 'e', 'l', 'l', 'o', ' ', '\\', 't', ' ', 'W', 'o', 'r', 'l', 'd', '.']

But here also you'll see two \\ in the resultant list, which is just a Python's way of representing \.

For Python interpreter '\' is an invalid string because \' in a string represent Single quote ('). Hence, when you do '\', it raises below error because for Python there is no end quote present in the string:

>>> '\'
  File "<stdin>", line 1
    '\'
      ^
SyntaxError: EOL while scanning string literal

If you can't declare your string as raw string (as it's already defined or imported from some other source), you may convert it to byte string by setting encoding as "unicode-escape":

>>> my_str = "    Hello \t World."

>>> unicode_escaped_string = my_str.encode('unicode-escape')
>>> unicode_escaped_string
b'    Hello \\t World.'

Since it is a byte-string, you need to call chr to get the corresponding character value of each byte. For example:

>>> list(map(chr, unicode_escaped_string))
[' ', ' ', ' ', ' ', 'H', 'e', 'l', 'l', 'o', ' ', '\\', 't', ' ', 'W', 'o', 'r', 'l', 'd', '.']
Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126
4

You could maybe convert to a Python's literal string and then split character by character?

string = "    Hello \t World."
string_raw = string.encode('unicode-escape')
print([ch for ch in string_raw])
print([chr(ch) for ch in string_raw])

Outputs:

[32, 32, 32, 32, 72, 101, 108, 108, 111, 32, 92, 116, 32, 87, 111, 114, 108, 100, 46]
[' ', ' ', ' ', ' ', 'H', 'e', 'l', 'l', 'o', ' ', '\\', 't', ' ', 'W', 'o', 'r', 'l', 'd', '.']

The Ascii 92 is a single backlash, even though when you print it in a terminal, it'll show it escaped.

Savir
  • 17,568
  • 15
  • 82
  • 136
-1

\t means tab, if you want to explicitely have a \ character, you'll need to escape it in your string:

string = "    Hello \\t World."

Or use a raw string:

string = r"    Hello \t World."
Axnyff
  • 9,213
  • 4
  • 33
  • 37