1

Here is the snippet:

for eachLine in content.splitlines(True):
    entity = str(eachLine.encode("utf-8"))[1:]
    splitResa = entity.split('\t')
    print(entity)
    print(splitResa)

Basically I am getting this result:

'<!ENTITY DOCUMENT_STATUS\t\t\t\t\t"draft">\n'
['\'<!ENTITY DOCUMENT_STATUS\\t\\t\\t\\t\\t"draft">\\n\'']

however in IDLE it all works fine:

>>> '<!ENTITY DOCUMENT_STATUS\t\t\t\t\t"draft">\n'.split('\t')
['<!ENTITY DOCUMENT_STATUS', '', '', '', '', '"draft">\n']

Couldn't figure out why. I've also tried answers here: splitting a string based on tab in the file But it still does the same behaviour. What is the issue here?

Community
  • 1
  • 1
Sarp Kaya
  • 3,686
  • 21
  • 64
  • 103
  • @PadraicCunningham it's `` – Sarp Kaya Mar 23 '15 at 09:34
  • 1
    Why are you encoding *in the first place*. And then removing the `b` from the `bytes` representation (debugging output!) but leaving in the single or double quotes? What is the problem you are trying to solve here? – Martijn Pieters Mar 23 '15 at 09:37
  • Moreover, you appear to be processing a XML DTD. Why not use a XML parser for the task? – Martijn Pieters Mar 23 '15 at 09:38
  • @SarpKaya. I meant where is it coming from, I don't understand why you are encoding – Padraic Cunningham Mar 23 '15 at 09:39
  • @MartijnPieters if I don't encode then I get UnicodeEncodeError: 'charmap' codec can't encode characters in position 141-142 – Sarp Kaya Mar 23 '15 at 09:41
  • @MartijnPieters it's not XML unfortunately, it's something else that just happens to use < and > stuff for formatting, where key and values are only paired up with a tab – Sarp Kaya Mar 23 '15 at 09:42
  • @PadraicCunningham I am reading a file that's what content is (file read) – Sarp Kaya Mar 23 '15 at 09:43
  • can you add some of your file input? – Padraic Cunningham Mar 23 '15 at 09:45
  • @SarpKaya: this is not the way to solve that. If you are getting encoding errors, then perhaps you need to deal with that *at the source of that problem*. – Martijn Pieters Mar 23 '15 at 09:49
  • @SarpKaya: for example, if you are printing, then that indicates that your console or terminal cannot handle those unicode points, not that your code is wrong. Reconfigure the console or terminal. If you are getting this when writing to a file, you used the default codec for files and should change that to one that can handle your code points, etc. – Martijn Pieters Mar 23 '15 at 09:50
  • @SarpKaya: what you did here is severely break your string. – Martijn Pieters Mar 23 '15 at 09:50
  • @SarpKaya: the only reason you are no longer getting those encoding errors is because all UTF-8 bytes outside of the ASCII range are going to be represented with *4 characters per byte*, a `\` backslash, the character `x`, and two hex characters. Unless you want to do a lot of work later on interpreting those again that is not something you want, nor is it efficient. – Martijn Pieters Mar 23 '15 at 09:53

2 Answers2

1

Looks like eachLine is a raw string.

>>> r'<!ENTITY DOCUMENT_STATUS\t\t\t\t\t"draft">\n'.split('\t')
['<!ENTITY DOCUMENT_STATUS\\t\\t\\t\\t\\t"draft">\\n']

So, you should either split that with a raw \t (r'\t'), like this

>>> r'<!ENTITY DOCUMENT_STATUS\t\t\t\t\t"draft">\n'.split(r'\t')
['<!ENTITY DOCUMENT_STATUS', '', '', '', '', '"draft">\\n']

or with properly escaped \t ('\\t'), like this

>>> r'<!ENTITY DOCUMENT_STATUS\t\t\t\t\t"draft">\n'.split('\\t')
['<!ENTITY DOCUMENT_STATUS', '', '', '', '', '"draft">\\n']
thefourtheye
  • 233,700
  • 52
  • 457
  • 497
  • 1
    They shouldn't be using string representations of a `bytes` object in the first place. Any UTF-8 bytes are also going to be mangled. – Martijn Pieters Mar 23 '15 at 09:52
  • Thanks for the answer. How do I convert a raw string to a normal string so that I can avoid using `r` completely? – Sarp Kaya Mar 23 '15 at 14:25
  • @SarpKaya What do you mean by that? Raw strings are normal strings only. If you want to avoid `r`, follow the second method I mentioned in the answer `\\t`. – thefourtheye Mar 24 '15 at 03:00
0

You produced a bytes representation; you mangled the repr() debugging output here. Any non-printable or special character is replaced by their escape sequence. The output you produced has no tab characters in the string, it contains sequences of the two characters \ and t:

>>> '\t'
'\t'
>>> '\t'.encode('utf8')
b'\t'
>>> str('\t'.encode('utf8'))
"b'\\t'"
>>> str('\t'.encode('utf8'))[1:]
"'\\t'"
>>> str('\t'.encode('utf8'))[1:][1:-1]
'\\t'
>>> len(str('\t'.encode('utf8'))[1:][1:-1])
2

It is not clear to me why you are encoding the text into bytes then converting back to a string in the first place. You don't want to do that, generally speaking.

In IDLE, you did not produce such mangled output; you just have a regular string with actual tabs, so splitting on those then works. My only advice here is to not encode to bytes here.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343