URL Encoding Questions for Results of Loading File Names with 'os.scandir' (or os.listdir)

Question

I've already posted a question, and I found an item that is believed to be the cause of the problem, but I'm inquiring again about the solution.

The cause is no other than os.When you URL encode the file name (file.name ) loaded with scandir, you will see a different phenomenon from the URL encoding of the actual entered string.

For example, if a file loaded with the following code is named '서울용마초등학교'

import os
from urllib import parse

files = os.scandir ('paths to specific folders')
for file in files:
print(file.name[15:].split('~')[0])
print(parse.quote(file.name[15:].split('~')[0]))

The encoding result for that is as follows.

%E1%84%89%E1%85%A5%E1%84%8B%E1%85%AE%E1%86%AF%E1%84%8B%E1%85%AD%E1%86%BC%E1%84%86%E1%85%A1%E1%84%8E%E1%85%A9%E1%84%83%E1%85%B3%E1%86%BC%E1%84%92%E1%85%A1%E1%86%A8%E1%84%80%E1%85%AD

However, the encoding result of the string typed '서울용마초등학교' is as follows.

%EC%84%9C%EC%9A%B8%EC%9A%A9%EB%A7%88%EC%B4%88%EB%93%B1%ED%95%99%EA%B5%90

And what's even more strange is that when you decode, both of the above results are encoded into '서울용마초등학교'.

If I request with the first encoding result as a parameter, an error occurs, and if I request with the actual '서울용마초등학교' string encoding result(second result), it works normally.

I would like to ask you how to first result will be the same as the second result. Please reply.

score 0 · Accepted Answer · answered Apr 14 '23 at 14:15

It's Unicode normalization.

For demonstration used script from my former answer (omitted all the parse.quote and parse.unquote junk).

encodeuni.py 서울용마초등학교

raw     서울용마초등학교        8       \uc11c\uc6b8\uc6a9\ub9c8\ucd08\ub4f1\ud559\uad50

NFC     서울용마초등학교        8       \uc11c\uc6b8\uc6a9\ub9c8\ucd08\ub4f1\ud559\uad50
NFKC    서울용마초등학교        8       \uc11c\uc6b8\uc6a9\ub9c8\ucd08\ub4f1\ud559\uad50
NFD     서울용마초등학교    20      \u1109\u1165\u110b\u116e\u11af\u110b\u116d\u11bc\u1106\u1161\u110e\u1169\u1103\u1173\u11bc\u1112\u1161\u11a8\u1100\u116d
NFKD    서울용마초등학교    20      \u1109\u1165\u110b\u116e\u11af\u110b\u116d\u11bc\u1106\u1161\u110e\u1169\u1103\u1173\u11bc\u1112\u1161\u11a8\u1100\u116d

The characters. Note that column CodePoint contains Unicode (U+hhhh) and UTF-8 bytes.

NFC, NFKC:

Char CodePoint                   Category Description
---- ---------                   -------- -----------
   서 {U+C11C, 0xEC,0x84,0x9C} OtherLetter Hangul Syllable Sios Eo
   울 {U+C6B8, 0xEC,0x9A,0xB8} OtherLetter Hangul Syllable Ieung U Rieul
   용 {U+C6A9, 0xEC,0x9A,0xA9} OtherLetter Hangul Syllable Ieung Yo Ieung
   마 {U+B9C8, 0xEB,0xA7,0x88} OtherLetter Hangul Syllable Mieum A
   초 {U+CD08, 0xEC,0xB4,0x88} OtherLetter Hangul Syllable Chieuch O
   등 {U+B4F1, 0xEB,0x93,0xB1} OtherLetter Hangul Syllable Tikeut Eu Ieung
   학 {U+D559, 0xED,0x95,0x99} OtherLetter Hangul Syllable Hieuh A Kiyeok
   교 {U+AD50, 0xEA,0xB5,0x90} OtherLetter Hangul Syllable Kiyeok Yo

NFD, NFKD:

Char CodePoint                   Category Description
---- ---------                   -------- -----------
   ᄉ {U+1109, 0xE1,0x84,0x89} OtherLetter Hangul Choseong Sios
   ᅥ {U+1165, 0xE1,0x85,0xA5} OtherLetter Hangul Jungseong Eo
   ᄋ {U+110B, 0xE1,0x84,0x8B} OtherLetter Hangul Choseong Ieung
   ᅮ {U+116E, 0xE1,0x85,0xAE} OtherLetter Hangul Jungseong U
   ᆯ {U+11AF, 0xE1,0x86,0xAF} OtherLetter Hangul Jongseong Rieul
   ᄋ {U+110B, 0xE1,0x84,0x8B} OtherLetter Hangul Choseong Ieung
   ᅭ {U+116D, 0xE1,0x85,0xAD} OtherLetter Hangul Jungseong Yo
   ᆼ {U+11BC, 0xE1,0x86,0xBC} OtherLetter Hangul Jongseong Ieung
   ᄆ {U+1106, 0xE1,0x84,0x86} OtherLetter Hangul Choseong Mieum
   ᅡ {U+1161, 0xE1,0x85,0xA1} OtherLetter Hangul Jungseong A
   ᄎ {U+110E, 0xE1,0x84,0x8E} OtherLetter Hangul Choseong Chieuch
   ᅩ {U+1169, 0xE1,0x85,0xA9} OtherLetter Hangul Jungseong O
   ᄃ {U+1103, 0xE1,0x84,0x83} OtherLetter Hangul Choseong Tikeut
   ᅳ {U+1173, 0xE1,0x85,0xB3} OtherLetter Hangul Jungseong Eu
   ᆼ {U+11BC, 0xE1,0x86,0xBC} OtherLetter Hangul Jongseong Ieung
   ᄒ {U+1112, 0xE1,0x84,0x92} OtherLetter Hangul Choseong Hieuh
   ᅡ {U+1161, 0xE1,0x85,0xA1} OtherLetter Hangul Jungseong A
   ᆨ {U+11A8, 0xE1,0x86,0xA8} OtherLetter Hangul Jongseong Kiyeok
   ᄀ {U+1100, 0xE1,0x84,0x80} OtherLetter Hangul Choseong Kiyeok
   ᅭ {U+116D, 0xE1,0x85,0xAD} OtherLetter Hangul Jungseong Yo

The script:

import sys
from unicodedata import normalize

def encodeuni(s):
    '''
    Returns input string encoded to escape sequences as in a string literal.
    Output is similar to
      str(s.encode('unicode_escape')).lstrip('b').strip("'").replace('\\\\','\\');
    but even every ASCII character is encoded as a \\xNN escape sequence
    (except a space character). For instance: 
    
    s = 'A á ř ';
    encodeuni(s);       # '\\x41 \\xe1 \\u0159 \\U0001f308'     while 
    str(s.encode('unicode_escape')).lstrip('b').strip("'").replace('\\\\','\\');
    #                   #    'A \\xe1 \\u0159 \\U0001f308'
    '''
    def encodechar(ch):
        ordch = ord(ch)
        return ( ch                if ordch == 0x20   else 
                 f"\\x{ordch:02x}" if ordch <= 0xFF   else
                 f"\\u{ordch:04x}" if ordch <= 0xFFFF else
                 f"\\U{ordch:08x}" )
                 
    return ''.join([encodechar(ch) for ch in s]) 

if len(sys.argv) >= 2 and sys.argv[1] != '':
    letters = (' '.join(
    [sys.argv[i] for i in range(1,len(sys.argv))])).strip()
    # .\SO\59979037.py  ÅÅÅ
else:
    letters = '\u212B \u00C5 \u0041\u030A \U0001f308'
    #          \u212B                     Å Angstrom Sign
    #                 \u00C5              Å Latin Capital Letter A With Ring Above
    #                        \u0041       A Latin Capital Letter A
    #                              \u030A ̊  Combining Ring Above
    #                                     \U0001f308  Rainbow

print('\t'.join( ['raw' ,
                  letters.ljust(10),
                  str(len(letters)),
                  encodeuni(letters),'\n']))
for form in ['NFC','NFKC','NFD','NFKD']:
    letnorm = normalize(form, letters)
    print( '\t'.join( [form,
                      letnorm.ljust(10),
                      str(len(letnorm)),
                      encodeuni(letnorm)]))

I followed your advice and it has been successfully improved. Thank you. :) !! — TAMDAO diptyque, Apr 15 '23 at 00:27
@TAMDAOdiptyque My pleasure. Please consider [accepting the answer](https://meta.stackoverflow.com/a/5235) if you find that it solved your problem. — JosefZ, Apr 15 '23 at 17:42

URL Encoding Questions for Results of Loading File Names with 'os.scandir' (or os.listdir)

1 Answers1