Reading files with special characters and writing them to html

Question

I have a python script that reads the names of pdf files and writes them to an HTML file with links to the PDFs. All works well unless a name has special characters.

I have read many other answers on SE to no avail.

f = open("jobs/index.html", "w")
#html divs go here
for root, dirs, files in os.walk('jobs/'):
    files.sort()
    for name in files:
        if ((name!="index.html")&(name!=".htaccess")):
            f.write("<a href='"+name+"'>"+name.rstrip(".pdf")+"</a>\n<br><br>\n")
            print name.rstrip(".pdf")

Returns:
Caba�n-Sanchez, Jane.pdf
Smith, John.pdf

Which is of course breaks the text and the link to that pdf.

How can I correctly encode the file or 'name' variable so that it writes special characters correctly?
ie, Cabán-Sanchez, Jane.pdf

olinox14 · Answer 1 · 2019-06-26T14:05:23.873

0

I'm not used to python 2.7, but this should work:

from io import open

with open("jobs/index.html", "w", encoding='utf-8') as f:
    for root, dirs, files in os.walk('jobs/'):
        files.sort()
        for name in files:
            if not name in ("index.html", ".htaccess"):
                f.write("<a href='{}'>{}</a>\n<br><br>\n".format(name, name.rstrip(".pdf")))
                print name.rstrip(".pdf")

You should also declare your encoding at a python level, by adding these lines at the top of your module:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

And finally, you may try to explicitly declare your string as unicode by adding a u"" your f.write line, like:

f.write(u"...")

Why io.open: Backporting Python 3 open(encoding="utf-8") to Python 2

Why you should use the with keyword when you can: https://www.pythonforbeginners.com/files/with-statement-in-python

edited Jun 26 '19 at 14:05

answered Jun 25 '19 at 15:22

olinox14

6,177
2
22
39

TypeError: 'encoding' is an invalid keyword argument for this function. I don't think encoding is supported in 2.7 – CCantey Jun 25 '19 at 18:36
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 3: ordinal not in range(128) on the f.write line... I'm kind of going around in circles between a few SO solutions and errors... – CCantey Jun 26 '19 at 13:47
I guess you can not switch to python3? encoding is a plague with python2.x... However, see my upcoming update – olinox14 Jun 26 '19 at 14:02
Unfortunately no, its packaged with GIS, which depends on 2.7. using utf-9 it forces me to use u'..' I have also declared utf-8 at top... Thanks – CCantey Jun 26 '19 at 15:54
If you could switch to QGis3+ you would also get python3... But anyway, did it worked for you? Is your problem solved? If it is, could you consider marking the answer as accepted? – olinox14 Jun 27 '19 at 07:45

Andra · Answer 2 · 2019-06-25T15:35:53.773

You're trying to write a unicode character (á in this case) to a html file, you should specify the html meta charset.

<meta charset="UTF-8">

The rest of it works fine in my machine though

andraantariksa@LaptopnyaAndra:~$ cd Desktop/
andraantariksa@LaptopnyaAndra:~/Desktop$ mkdir jobs
andraantariksa@LaptopnyaAndra:~/Desktop$ cd jobs/
andraantariksa@LaptopnyaAndra:~/Desktop/jobs$ touch "Cabán-Sanchez, Jane.pdf"
andraantariksa@LaptopnyaAndra:~/Desktop/jobs$ ls
'Cabán-Sanchez, Jane.pdf'
andraantariksa@LaptopnyaAndra:~/Desktop/jobs$ cd ../
andraantariksa@LaptopnyaAndra:~/Desktop$ python
Python 2.7.15+ (default, Nov 27 2018, 23:36:35) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> f = open("jobs/index.html", "w")
>>> #html divs go here
... for root, dirs, files in os.walk('jobs/'):
...     files.sort()
...     for name in files:
...         if ((name!="index.html")&(name!=".htaccess")):
...             f.write("<a href='"+name+"'>"+name.rstrip(".pdf")+"</a>\n<br><br>\n")
...             print name.rstrip(".pdf")
... 
Cabán-Sanchez, Jane
andraantariksa@LaptopnyaAndra:~/Desktop$ cat jobs/index.html 
<a href='Cabán-Sanchez, Jane.pdf'>Cabán-Sanchez, Jane</a>
<br><br>

This all looks good to me, I'm not sure why its not working on my end... I have at the top of my html... — CCantey, Jun 25 '19 at 15:52

Reading files with special characters and writing them to html

2 Answers2