urllib downloading contents of an online directory

Question

I'm trying to make a program that will open a directory, then use regular expressions to get the names of powerpoints and then create files locally and copy their content. When I run this it appears to work, however when I actually try to open the files they keep saying the version is wrong.

from urllib.request import urlopen
import re

urlpath = urlopen('http://www.divms.uiowa.edu/~jni/courses/ProgrammignInCobol/presentation/')
string = urlpath.read().decode('utf-8')

pattern = re.compile('ch[0-9]*.ppt') #the pattern actually creates duplicates in the list

filelist = pattern.findall(string)
print(filelist)

for filename in filelist:
    remotefile = urlopen('http://www.divms.uiowa.edu/~jni/courses/ProgrammignInCobol/presentation/' + filename)
    localfile = open(filename,'wb')
    localfile.write(remotefile.read())
    localfile.close()
    remotefile.close()

You should **never** parse HTML with RegEx, see http://stackoverflow.com/a/1732454/851737. Use a HTML parsing library like lxml or BeautifulSoup. — schlamar, Jun 04 '12 at 07:01

score 10 · Accepted Answer · answered Jun 04 '12 at 01:10

This code worked for me. I just modified it a little because yours was duplicating each ppt file.

from urllib2 import urlopen
import re

urlpath =urlopen('http://www.divms.uiowa.edu/~jni/courses/ProgrammignInCobol/presentation/')
string = urlpath.read().decode('utf-8')

pattern = re.compile('ch[0-9]*.ppt"') #the pattern actually creates duplicates in the list

filelist = pattern.findall(string)
print(filelist)

for filename in filelist:
    filename=filename[:-1]
    remotefile = urlopen('http://www.divms.uiowa.edu/~jni/courses/ProgrammignInCobol/presentation/' + filename)
    localfile = open(filename,'wb')
    localfile.write(remotefile.read())
    localfile.close()
    remotefile.close()

urllib downloading contents of an online directory

1 Answers1

Linked