0

I have a python script that reads the names of pdf files and writes them to an HTML file with links to the PDFs. All works well unless a name has special characters.

I have read many other answers on SE to no avail.

f = open("jobs/index.html", "w")
#html divs go here
for root, dirs, files in os.walk('jobs/'):
    files.sort()
    for name in files:
        if ((name!="index.html")&(name!=".htaccess")):
            f.write("<a href='"+name+"'>"+name.rstrip(".pdf")+"</a>\n<br><br>\n")
            print name.rstrip(".pdf")

Returns:
Caba�n-Sanchez, Jane.pdf
Smith, John.pdf

Which is of course breaks the text and the link to that pdf.

How can I correctly encode the file or 'name' variable so that it writes special characters correctly?
ie, Cabán-Sanchez, Jane.pdf

CCantey
  • 306
  • 1
  • 14

2 Answers2

0

I'm not used to python 2.7, but this should work:

from io import open

with open("jobs/index.html", "w", encoding='utf-8') as f:
    for root, dirs, files in os.walk('jobs/'):
        files.sort()
        for name in files:
            if not name in ("index.html", ".htaccess"):
                f.write("<a href='{}'>{}</a>\n<br><br>\n".format(name, name.rstrip(".pdf")))
                print name.rstrip(".pdf")

You should also declare your encoding at a python level, by adding these lines at the top of your module:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

And finally, you may try to explicitly declare your string as unicode by adding a u"" your f.write line, like:

f.write(u"...")
olinox14
  • 6,177
  • 2
  • 22
  • 39
  • TypeError: 'encoding' is an invalid keyword argument for this function. I don't think encoding is supported in 2.7 – CCantey Jun 25 '19 at 18:36
  • UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 3: ordinal not in range(128) on the f.write line... I'm kind of going around in circles between a few SO solutions and errors... – CCantey Jun 26 '19 at 13:47
  • I guess you can not switch to python3? encoding is a plague with python2.x... However, see my upcoming update – olinox14 Jun 26 '19 at 14:02
  • Unfortunately no, its packaged with GIS, which depends on 2.7. using utf-9 it forces me to use u'..' I have also declared utf-8 at top... Thanks – CCantey Jun 26 '19 at 15:54
  • If you could switch to QGis3+ you would also get python3... But anyway, did it worked for you? Is your problem solved? If it is, could you consider marking the answer as accepted? – olinox14 Jun 27 '19 at 07:45
0

You're trying to write a unicode character ( in this case) to a html file, you should specify the html meta charset.

<meta charset="UTF-8">

The rest of it works fine in my machine though

andraantariksa@LaptopnyaAndra:~$ cd Desktop/
andraantariksa@LaptopnyaAndra:~/Desktop$ mkdir jobs
andraantariksa@LaptopnyaAndra:~/Desktop$ cd jobs/
andraantariksa@LaptopnyaAndra:~/Desktop/jobs$ touch "Cabán-Sanchez, Jane.pdf"
andraantariksa@LaptopnyaAndra:~/Desktop/jobs$ ls
'Cabán-Sanchez, Jane.pdf'
andraantariksa@LaptopnyaAndra:~/Desktop/jobs$ cd ../
andraantariksa@LaptopnyaAndra:~/Desktop$ python
Python 2.7.15+ (default, Nov 27 2018, 23:36:35) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> f = open("jobs/index.html", "w")
>>> #html divs go here
... for root, dirs, files in os.walk('jobs/'):
...     files.sort()
...     for name in files:
...         if ((name!="index.html")&(name!=".htaccess")):
...             f.write("<a href='"+name+"'>"+name.rstrip(".pdf")+"</a>\n<br><br>\n")
...             print name.rstrip(".pdf")
... 
Cabán-Sanchez, Jane
andraantariksa@LaptopnyaAndra:~/Desktop$ cat jobs/index.html 
<a href='Cabán-Sanchez, Jane.pdf'>Cabán-Sanchez, Jane</a>
<br><br>
Andra
  • 1,282
  • 2
  • 11
  • 34
  • This all looks good to me, I'm not sure why its not working on my end... I have at the top of my html... – CCantey Jun 25 '19 at 15:52