0

I am trying to get a list of files and directories present in a specified URL. The URL I'm using is of an online Dictionary : www.shabdkosh.com/kn/browse/. My code is as follows:

html_files = []

for root, dirs, files in os.walk("www.shabdkosh.com/kn/browse"):
    for file in files:
        #Files in shabdkosh have a digit as name to represent page number
        if file.isdigit():
            html_files.append(os.path.join(root, file))

when I print the contents of files, I get:

www.shabdkosh.com/kn/browse/3/1
www.shabdkosh.com/kn/browse/a/1
www.shabdkosh.com/kn/browse/a/10
www.shabdkosh.com/kn/browse/a/2
...

This is cool. But other URLs should have also been retrieved. The URLs containing Kannada alphabets are not displayed (Kannada is an Indian Language) even though they exist.

For example,

www.shabdkosh.com/kn/browse/ಅ/

Like so are not displayed even though they lie in the path "www.shabdkosh.com/kn/browse" specified as the parameter for os.walk. So, how do I get os.walk to get the list of URLs with the Kannada letters ?

I even tried including the following code at the top of my python file:

#!/usr/bin/env python
# -*- coding: ascii -*-

But no luck. Any help is appreciated.

P.S Sorry if it bothers you that I'm using Old python 2.7.

Gang
  • 2,658
  • 3
  • 17
  • 38
Ajay H
  • 794
  • 2
  • 11
  • 28
  • Is not that what you wanted `if file.isdigit(file`? remove the condition, you will get more. – Gang Feb 11 '17 at 15:04
  • No luck. I even printed the "files" outside the condition. I only get pure English URls – Ajay H Feb 11 '17 at 15:11

1 Answers1

1

Couple things to try:

  1. If you're using any coding at all, it should be utf-8, not ascii. Clearly those are not ascii characters.
  2. Make sure your path is unicode, e.g. os.walk(u"www.shabdkosh.com/kn/browse"). See Ciro's comment on Using os.walk() to recursively traverse directories in Python
Community
  • 1
  • 1
kevin
  • 111
  • 4