0

I currently try to implement a simple webdownloader, which downloads files recursive throughout the only directory.

What i got to list the files on the server: Updater.cs:

    public static List<string> remote_filecheck()
    {
        List<string> rfiles = new List<string>();
        string url = "http://********/patchlist.txt";
        WebClient client = new WebClient();
        client.DownloadFile(url, @"patchlist.txt");

        string line;
        StreamReader reader = new StreamReader("patchlist.txt");

        while ((line = reader.ReadLine()) != null)
        {
            rfiles.Add(line);
        }
        reader.Close();
        return rfiles;
    }

I currently work with a patchlist, which consists of all direct links to my http files.

I tried nearly every single snippet on the web concerning recursive download e.g. RegEx, WebRequests and stuff.

Now i want to know if you got a good way to go recursive through my HTTP Server and list all the filenames, which is all i want to know.

When i have a List<string> of filenames, then i am able to do the rest.

John Saunders
  • 160,644
  • 26
  • 247
  • 397
Nop0x
  • 173
  • 1
  • 14
  • You might find this useful for finding the directory listing: http://stackoverflow.com/questions/124492/c-httpwebrequest-command-to-get-directory-listing – Reddog Oct 12 '11 at 22:01
  • well, as i already said, i tried the regex stuff, but i got stuck with the right regex and my output was very strange. and yes my server has directory listing enabled. – Nop0x Oct 12 '11 at 22:05
  • They have a regex that supposedly works listed in there. Or perhaps have a look at the solution that uses the HTML Agility Pack in there... – Reddog Oct 12 '11 at 22:16
  • well, the html agility pack works with links in .htm files, what i dont want to :D i think i will stick back to these RegEx, and try to get what i want. – Nop0x Oct 12 '11 at 22:28
  • As per that link, it explains that the only directory listing output you can get at remotely is via the HTML generated by the webserver (and unfortunately that's non-standard). If you have control of the web server you could of course easily write a webservice or http handler to return a more standardised response (for example, in XML). – Reddog Oct 12 '11 at 22:32

1 Answers1

0

Has the server that you are trying to get the files from got indexing switched on?

If so then it's probably a matter of scraping this page that comes back and then visiting each url one by one.

If not then I'm not sure it can be done very easily.

Ok based on comments below I think you'll want to do something like this:

        string indexUrl = "http://www.stackoverflow.com";

        WebBrowser browser = new WebBrowser();
        browser.Navigate(indexUrl);

        do
        {
            Application.DoEvents();
        } while (browser.ReadyState != WebBrowserReadyState.Complete);



        var listOfFilePaths = new List<string>();


        foreach (HtmlElement linkElement in browser.Document.GetElementsByTagName("a"))
        {
            var pagePath = linkElement.GetAttribute("href");
            listOfFilePaths.Add(pagePath);
        }

Note that the WebBrowser control needs to be run in a Windows forms app to get it work (easily). The indexPath variable I used should be changed to the path of the index page of the server (I just used stackoverflow as an example).

The foreach loop extracts all anchor (a) tags out of the site and gets the path they are pointing to and adds them to the listOfFilePaths collection.

Once this code has finished executing the listOfFilePaths collection will contain an entry for every link on the index page and hence a link to every file on the server.

From here it's a matter of looping round the listOfFilePaths collection and downloading each file one by one. Perhaps even using some rules not to download certain types of files that you're not interested in. I believe from what you've said you should be able to do this.

Hope this helps.

Kevin Holditch
  • 5,165
  • 3
  • 19
  • 35
  • well the server has indexing switched on. But if i try to do it the RegularExpression Way, I get some strage output to my list. – Nop0x Oct 12 '11 at 21:05
  • Ok can you give a small snippet of the html page that you get back from the server? – Kevin Holditch Oct 12 '11 at 21:09
  • well i dont want to get a html page, i want to get some files frome the server (some binaries, xnb and c# files). in fact, i want to build some kind of updater, which is able to compare the local files to the ones on the webserver and download new or missing files. – Nop0x Oct 12 '11 at 21:13
  • Yes I understand that but what from your question above what I think you want to do is visit the index page and then from there work out what files are on the server and then download them. What I'm suggesting is grabbing the html markup from the index page and then writing a parser to read the urls and then visit them one by one to download the content. – Kevin Holditch Oct 12 '11 at 21:18
  • What i want to do is work through the files, save the filenames and download them afterwards. The downloading is something im able to do on my own, what im not able to do is crawling through the server and get the filenames. well, have to got a little snippet for me to get me to the right direction for the parser? – Nop0x Oct 12 '11 at 21:20
  • Exactly, which is why I'm asking for a snippet of the html from the index page as that's where the file names are stored. From there I can help you come up with an algorithm to download all of the files. – Kevin Holditch Oct 12 '11 at 21:22
  • Thanks for your help! :) heres the HTML code: `[DIR]Parent Directory  - [DIR]Content/12-Oct-2011 18:02 - [   ]Project eXistence.exe12-Oct-2011 18:00 30K` – Nop0x Oct 12 '11 at 21:25
  • Well your snippet helped me,but now i get this output: `http://*****/patchtest/?C=N;O=D http://*****/patchtest/?C=M;O=A http://*****/patchtest/?C=S;O=A http://*****3/patchtest/?C=D;O=A http://*****/ http://*****/patchtest/Content/ http://*****/patchtest/Project%20eXistence.exe http://*****/patchtest/patchlist.txt` – Nop0x Oct 12 '11 at 22:01