2

I have a ftp link that contains some links to files that I am interested in downloading:

ftp://lidar.wustl.edu/Phelps_Rolla/

I can list all of the urls using the following:

import urllib2
import BeautifulSoup

request = urllib2.Request("ftp://lidar.wustl.edu/Phelps_Rolla/")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)

>>> soup
drwxrwxrwx   1 user     group           0 Nov  7  2012 .
drwxrwxrwx   1 user     group           0 Nov  7  2012 ..
drwxrwxrwx   1 user     group           0 Nov  7  2012 ESRI_Grids
drwxrwxrwx   1 user     group           0 Nov  7  2012 ESRI_Shapefiles
drwxrwxrwx   1 user     group           0 Nov  7  2012 LAS_Files
-rw-rw-rw-   1 user     group      545700 May 27  2011 LiDAR Accuracy Report_Rolla.pdf
drwxrwxrwx   1 user     group           0 Nov  7  2012 Rolla Survey
-rw-rw-rw-   1 user     group        4865 May 26  2011 Rolla_SEMA_Tile_Index.dbf
-rw-rw-rw-   1 user     group         503 May 26  2011 Rolla_SEMA_Tile_Index.prj
-rw-rw-rw-   1 user     group         188 May 26  2011 Rolla_SEMA_Tile_Index.sbn
-rw-rw-rw-   1 user     group         124 May 26  2011 Rolla_SEMA_Tile_Index.sbx
-rw-rw-rw-   1 user     group        1100 May 26  2011 Rolla_SEMA_Tile_Index.shp
-rw-rw-rw-   1 user     group       12682 May 31  2011 Rolla_SEMA_Tile_Index.shp.xml
-rw-rw-rw-   1 user     group         140 May 26  2011 Rolla_SEMA_Tile_Index.shx

How can I download only the links that contain "Tile" or "tile" with the extensions ".dbf", ".prj", ".shp", and ".shx"?

Borealis
  • 8,044
  • 17
  • 64
  • 112

1 Answers1

5

You are using urllib abd beautiful soup, but when dealing with FTP specialized Standard Library module ftplib is probably better choice. Head to docs and read how to connect to FTP and open connection and list directory, there are simple walk troughs there.

Next step is figuring out how to filter your files, this is matter of some list comprehension filtering strings to those that have some string inside, e.g. see this question or this question. Finally you need to google how to download files via FTP, you will find this question. Turns out that file downloads are made with a call ftp.retrbinary().

Here is simple script that does all the things I mentioned above:

from ftplib import FTP

ftp = FTP("lidar.wustl.edu")
ftp.login()
ftp.cwd("Phelps_Rolla")
# list files with ftplib
file_list = ftp.nlst()

for f in file_list:
    # apply your filters
    if "tile" in f.lower() and any(f.endswith(ext) for ext in ['dbf', 'prj', 'shp', 'shx']):
        # download file sending "RETR <name of file>" command
        # open(f, "w").write is executed after RETR suceeds and returns file binary data
        ftp.retrbinary("RETR {}".format(f), open(f, "wb").write)
        print("downloaded {}".format(f))
ftp.quit()
Community
  • 1
  • 1
Pawel Miech
  • 7,742
  • 4
  • 36
  • 57