-1

The sample site I am using is: http://stats.jenkins.io/jenkins-stats/svg/svgs.html

There are a ton of CSVs linked on this site. Now obviously I can go through each link click and download, but I know there is a better way.

I was able to put together the following Python script using BeautifulSoup but all it does is print the soup:

from bs4 import BeautifulSoup
import urllib2
jenkins = "http://stats.jenkins.io/jenkins-stats/svg/svgs.html"
page = urllib2.urlopen(jenkins)
soup = BeautifulSoup(page)
print soup

Below is a sample I get when I print the soup, but I am still missing how to actually download the multiple CSV files from this detail.

<td>
  <a alt="201412-jobs.svg" class="info" data-content="&lt;object data='201412-jobs.svg' width='200' type='image/svg+xml'/&gt;" data-original-title="201412-jobs.svg" href="201412-jobs.svg" rel="popover">SVG</a>
  <span>/</span>
  <a alt="201412-jobs.csv" class="info" href="201412-jobs.csv">CSV</a>
</td>
ggorlen
  • 44,755
  • 7
  • 76
  • 106
hansolo
  • 903
  • 4
  • 12
  • 28

2 Answers2

2

Just use a BeatifulSoup to parse this webpage and get all the URLs of the CSV files and then download each one using urllib.request.urlretrieve(). This is a one time task, so I don`t think, that you need anything like Scrapy for it.

vchslv13
  • 73
  • 7
  • thanks for taking the time to answer my question. I made edits to my initial question as it was flagged. Hopefully it illustrates where I am at more accurately. I understand parsing the webpage using beautiful soup but I don't know how to get the URLs of the CSV and then execute a command to download. – hansolo Jan 13 '17 at 19:05
  • 2
    Well, you don`t have to just parse the page. You should extract the links to CSVs and pass these links as arguments to urlretrieve() function. To get more info just read the beautifulsoup and python manuals. Sorry, but giving more detailed instruction is just writing the whole script for you. – vchslv13 Jan 13 '17 at 19:16
1

I totally get where youre coming from, have wanted to do the same myself, lucky if you are a linux use there is a super easy way to do what you want. On the other side, using a webscraper, im familiar with bs4 but scrapy is my life (sadly) but as far as I recall bs/4 has no real option-able way to download without to use of urlib/request but all the same !!

As to your current bs4 spider,,, First you should probably ascertain only the links that are .csv, extract clean.. I IMAGINE it would look like

for link in soup.select('a[href^="http://"]'):
    href = link.get('href')
    if not any(href.endswith(x) for x in ['.csv'. '.fileformatetcetc'])
    continue

This is like doing find all but limiting the response to ... well only the once with .csv or desired extension...

Then you would join the responses from that to the base url(if its incomplete). If not needed the Using csv module you would read out the csv files... (from the responses right!!?) the write it out to a new file... For the lols Im going to create a scrapy version.

AS for that easy method... why not just use wget?

asciicast

Found this... sums up on the whole csv read/write process... https://stackoverflow.com/a/21501574/3794089

Community
  • 1
  • 1
scriptso
  • 677
  • 4
  • 14
  • 1
    I knew this was a common use case and I just needed someone to demonstrate the wget technique to understand where I was going wrong. THANK YOU VERY MUCH for providing a helpful, intelligent, succinct answer without attitude. Saved me many clicks try to download 1,057 CSV files. – hansolo Jan 15 '17 at 03:09
  • 2
    @hansolo !! I appreciate your civility ... though attitudes on here tend to be more on the neutral side compared to other forums/QnA thread type sites I frequent... I feel like Im constantly having to keep the guards up... Your demeanor is refreshing in the midst of all the "flamers" anyways! Glad it help! There is a benefit to having a spider/scraper framework to this this task ... for bulk purposes ! reinventing the wheel is not a waste of time if you're strengthening a skill right?! good luck friend. – scriptso Jan 15 '17 at 06:47