8

I'm trying to make a program that will open a directory, then use regular expressions to get the names of powerpoints and then create files locally and copy their content. When I run this it appears to work, however when I actually try to open the files they keep saying the version is wrong.

from urllib.request import urlopen
import re

urlpath = urlopen('http://www.divms.uiowa.edu/~jni/courses/ProgrammignInCobol/presentation/')
string = urlpath.read().decode('utf-8')

pattern = re.compile('ch[0-9]*.ppt') #the pattern actually creates duplicates in the list

filelist = pattern.findall(string)
print(filelist)

for filename in filelist:
    remotefile = urlopen('http://www.divms.uiowa.edu/~jni/courses/ProgrammignInCobol/presentation/' + filename)
    localfile = open(filename,'wb')
    localfile.write(remotefile.read())
    localfile.close()
    remotefile.close()
martineau
  • 119,623
  • 25
  • 170
  • 301
davelupt
  • 1,845
  • 4
  • 21
  • 32
  • 2
    You should **never** parse HTML with RegEx, see http://stackoverflow.com/a/1732454/851737. Use a HTML parsing library like lxml or BeautifulSoup. – schlamar Jun 04 '12 at 07:01
  • BeautifulSoup it is. Thank you for your recommendation. – davelupt Jun 04 '12 at 16:56

1 Answers1

10

This code worked for me. I just modified it a little because yours was duplicating each ppt file.

from urllib2 import urlopen
import re

urlpath =urlopen('http://www.divms.uiowa.edu/~jni/courses/ProgrammignInCobol/presentation/')
string = urlpath.read().decode('utf-8')

pattern = re.compile('ch[0-9]*.ppt"') #the pattern actually creates duplicates in the list

filelist = pattern.findall(string)
print(filelist)

for filename in filelist:
    filename=filename[:-1]
    remotefile = urlopen('http://www.divms.uiowa.edu/~jni/courses/ProgrammignInCobol/presentation/' + filename)
    localfile = open(filename,'wb')
    localfile.write(remotefile.read())
    localfile.close()
    remotefile.close()
apple16
  • 1,137
  • 10
  • 13