2

I have a page that contains some links to .mp3/.wav files in that format

<a href="http://siteName/subfolder/filename.mp3">File Name</a>

what I need to make a script that will download all these files instead of downloading them my self

I know that I can use regular expression to do some thing like that but i don't know how ? and what is the best choose to do that (Java , C# , JavaScript) ?

Any help will be appreciated

Thanks in Advance

Amira Elsayed Ismail
  • 9,216
  • 30
  • 92
  • 175

3 Answers3

1

You could use SgmlReader to parse the DOM and extract all the anchor links and then download the corresponding resources:

class Program
{
    static void Main()
    {
        using (var reader = new SgmlReader())
        {
            reader.DocType = "HTML";
            reader.Href = "http://www.example.com";
            var doc = new XmlDocument();
            doc.Load(reader);
            var anchors = doc.SelectNodes("//a/@href[contains(., 'mp3') or contains(., 'wav')]");
            foreach (XmlAttribute href in anchors)
            {
                using (var client = new WebClient())
                {
                    var data = client.DownloadData(href.Value);
                    // TODO: do something with the downloaded data
                }
            }
        }
    }
}
Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
1

Well, if you want to go hard-core, I think parsing the page with DOMDocument ( http://php.net/manual/en/class.domdocument.php ) and retrieving the files with cURL would do it if you're ok with PHP.

How many files are we talking about here?

Claudiu
  • 3,261
  • 1
  • 15
  • 27
  • Thanks for your reply : about 200 file or more – Amira Elsayed Ismail Oct 09 '10 at 15:59
  • Ow well, might not be the ideal task for PHP, but if you're more into experimenting you can go with it. Otherwise, go with something like what @Darin suggested, altough, it's more or less the same approach, except the fact that we're talking about different languages :) – Claudiu Oct 09 '10 at 16:10
1

Python's Beautiful Soup library is well-suited to this task: http://www.crummy.com/software/BeautifulSoup/

Could be used in this way:

import urllib2, re
from BeautifulSoup import BeautifulSoup

#open the URL
page = urllib2.urlopen("http://www.foo.com")
#parse the page
soup = BeautifulSoup(page)
#get all anchor elements
anchors = soup.findAll("a")
#filter anchors based on their href attribute
filteredAnchors = filter(lambda a : re.search("\.wav",a["href"]) or re.search("\.mp3",a["href"]), anchors)
urlsToDownload = map(lambda a : a["href"],filteredAnchors)
#download each anchor url...

See here for instructions on downloading the mp3's from their URLs: How do I download a file over HTTP using Python?

Community
  • 1
  • 1
jbeard4
  • 12,664
  • 4
  • 57
  • 67
  • Thanks Mr/Ms. echo-flow for your answer , I never used python before and I don't have any information about it, But I'd love to know what is python language? and what is the advantage/disadvantage of this language over C#,Java or C++ ? ,please if you have time answer me , Thanks in Advance – Amira Elsayed Ismail Oct 09 '10 at 17:38
  • Off the top of my head, I can say: Python is a general-purpose, dynamically typed, object-oriented scripting language. It is often referred to as "executable pseudocode" because its programs are extremely readable. It's used by NASA, as well as Google for services like gmail. It's open source, and developed and maintained by the community. Its advantage over languages such as C#, Java or C++ is that it's at a higher level, and is extremely flexible both in terms of its syntax and its semantics. A disadvantage is that it can be slower than any of these languages. See python.org for more info. – jbeard4 Oct 09 '10 at 21:27