2

I'm trying to convert SPSS syntax files to readable HTML. It's working almost perfectly except that a (single) non printable character is inserted into the HTML file. It doesn't seem to have an ASCII code and looks like a tiny dot. And it's causing trouble.

It occurs (only) in the second line of the HTML file, always corresponding to the first line of the original file. Which probably hints at which line(s) of Python cause the problem (please see comments)

The code which seems to cause this is

    rfil = open(fil,"r") #rfil =  Read File, original syntax
    wfil = open(txtFil,"w") #wfil =  Write File, HTML output
    #Line below causes problem??
    wfil.write("<ol class='code'>\n<li>") 
    cnt = 0
    for line in rfil:
        if cnt == 0:
            #Line below causes problem??
            wfil.write(line.rstrip("\n").replace("'",'&#39;').replace('"','&#34;')) 
        elif len(line) > 1:
            wfil.write("</li>\n<li>" + line.strip("\n").replace("'",'&#39;').replace('"','&#34;'))
        else:
            wfil.write("<br /><br />")
        cnt += 1
    wfil.write("</li>\n</ol>")
    wfil.close()
    rfil.close()

Screen shot of the result

enter image description here

9000
  • 39,899
  • 9
  • 66
  • 104
RubenGeert
  • 2,902
  • 6
  • 32
  • 50
  • What does "causing trouble" mean in this case? I am a utf-8 fundamentalist. When you read into python, try to convert it into utf-8 or unicode first. When you write out, always use utf-8. But I don't actually know if that advice addresses your problem. – Adrian Ratnapala May 14 '13 at 08:59
  • You can strip a file from unprintable files using: `import string; "".join(s for s in foo if s in string.printable)` [More information here](http://stackoverflow.com/a/16402009/1076493) – timss May 14 '13 at 09:03
  • @AdrianRatnapala: "Causing trouble" means that the non printable character is probably inserted by that line of Python code. When I view the final HTML page in the browser, it shows up really weird and that's what I'm trying to fix. – RubenGeert May 14 '13 at 09:03
  • 1
    Try `print repr(line)` to see the code of the character. – Janne Karila May 14 '13 at 09:04
  • If I don't edit the .sps input file in any way but just read each line and write it to a new .html output file, all goes well. Which makes me think that one of the Python string manipulations is incorrect and causing the problem. – RubenGeert May 14 '13 at 09:06
  • @RubenGeert Which is why you should try to `repr()` the text you're generating, or strip'ing it using `string.printable`. – timss May 14 '13 at 09:07
  • Sorry, I didn't know `repr(line)`. It produces something like `xef\xbb\xbf` where the problem character shows up. – RubenGeert May 14 '13 at 09:10

2 Answers2

4

The input file seems to begin with a byte order mark (BOM), to indicate UTF-8 encoding. You can decode the file to Unicode strings by opening it with

import codecs
rfil = codecs.open(fil, "r", "utf_8_sig")

The utf_8_sig encoding skips the BOM in the beginning.

Some programs recognize the BOM, some don't. To write the file out without BOM, use

wfil = codecs.open(txtFil, "w", "utf_8")
Janne Karila
  • 24,266
  • 6
  • 53
  • 94
1

What you see is a byte-order mark, or BOM. The way you see it , \xef\xbb\xbf, says that the stringgs you work with are actually UTF-8; you can convert them into proper Unicode (line.decode('utf-8')) to make manipulation easier.

Then you can augment the logic for the first line so that it safely removes the BOM:

for raw_line in rfil:
    line = raw_line.decode('utf-8') # now line is Unicode
    if cnt == 0 and line[0] == '\ufeff':
        line = line[1:] # cut the first character, which is a BOM
    ...
9000
  • 39,899
  • 9
  • 66
  • 104