Download Multiple Linked CSV files from a site

Question

The sample site I am using is: http://stats.jenkins.io/jenkins-stats/svg/svgs.html

There are a ton of CSVs linked on this site. Now obviously I can go through each link click and download, but I know there is a better way.

I was able to put together the following Python script using BeautifulSoup but all it does is print the soup:

from bs4 import BeautifulSoup
import urllib2
jenkins = "http://stats.jenkins.io/jenkins-stats/svg/svgs.html"
page = urllib2.urlopen(jenkins)
soup = BeautifulSoup(page)
print soup

Below is a sample I get when I print the soup, but I am still missing how to actually download the multiple CSV files from this detail.

<td>
  <a alt="201412-jobs.svg" class="info" data-content="&lt;object data='201412-jobs.svg' width='200' type='image/svg+xml'/&gt;" data-original-title="201412-jobs.svg" href="201412-jobs.svg" rel="popover">SVG</a>
  <span>/</span>
  <a alt="201412-jobs.csv" class="info" href="201412-jobs.csv">CSV</a>
</td>

score 2 · Answer 1 · answered Jan 13 '17 at 18:38

2

Just use a BeatifulSoup to parse this webpage and get all the URLs of the CSV files and then download each one using urllib.request.urlretrieve(). This is a one time task, so I don`t think, that you need anything like Scrapy for it.

answered Jan 13 '17 at 18:38

vchslv13

73
7

thanks for taking the time to answer my question. I made edits to my initial question as it was flagged. Hopefully it illustrates where I am at more accurately. I understand parsing the webpage using beautiful soup but I don't know how to get the URLs of the CSV and then execute a command to download. – hansolo Jan 13 '17 at 19:05
2

Well, you don`t have to just parse the page. You should extract the links to CSVs and pass these links as arguments to urlretrieve() function. To get more info just read the beautifulsoup and python manuals. Sorry, but giving more detailed instruction is just writing the whole script for you. – vchslv13 Jan 13 '17 at 19:16

score 1 · Accepted Answer · edited May 23 '17 at 12:24

I totally get where youre coming from, have wanted to do the same myself, lucky if you are a linux use there is a super easy way to do what you want. On the other side, using a webscraper, im familiar with bs4 but scrapy is my life (sadly) but as far as I recall bs/4 has no real option-able way to download without to use of urlib/request but all the same !!

As to your current bs4 spider,,, First you should probably ascertain only the links that are .csv, extract clean.. I IMAGINE it would look like

for link in soup.select('a[href^="http://"]'):
    href = link.get('href')
    if not any(href.endswith(x) for x in ['.csv'. '.fileformatetcetc'])
    continue

This is like doing find all but limiting the response to ... well only the once with .csv or desired extension...

Then you would join the responses from that to the base url(if its incomplete). If not needed the Using csv module you would read out the csv files... (from the responses right!!?) the write it out to a new file... For the lols Im going to create a scrapy version.

AS for that easy method... why not just use wget?

Found this... sums up on the whole csv read/write process... https://stackoverflow.com/a/21501574/3794089

I knew this was a common use case and I just needed someone to demonstrate the wget technique to understand where I was going wrong. THANK YOU VERY MUCH for providing a helpful, intelligent, succinct answer without attitude. Saved me many clicks try to download 1,057 CSV files. — hansolo, Jan 15 '17 at 03:09
@hansolo !! I appreciate your civility ... though attitudes on here tend to be more on the neutral side compared to other forums/QnA thread type sites I frequent... I feel like Im constantly having to keep the guards up... Your demeanor is refreshing in the midst of all the "flamers" anyways! Glad it help! There is a benefit to having a spider/scraper framework to this this task ... for bulk purposes ! reinventing the wheel is not a waste of time if you're strengthening a skill right?! good luck friend. — scriptso, Jan 15 '17 at 06:47

Download Multiple Linked CSV files from a site

2 Answers2