0

So, I have code which downloads pictures from parsed links. Downloading/parsing works well, but I have a problem with loading full content of this page.

/*
 * https://shikimori.org/animes/38256-magia-record-mahou-shoujo-madoka-magica-gaiden-tv
 * For testing
 */

class Program
{
    static string root = @"C:\Shikimori\";
    static List<string> sources = new List<string>();

    [STAThread]
    static void Main(string[] args)
    {
        Console.Write("Enter link: ");
        var link = Console.ReadLine(); // enter there link from above
        link += "/art";

        var web = new HtmlWeb();
        web.BrowserTimeout = TimeSpan.FromTicks(0);

        var htmlDocument = new HtmlDocument();

        Thread.Sleep(3000);

        try
        {
            htmlDocument = web.LoadFromBrowser(link); //ones per, like, 30m loading almost all page with whole pictures
        }
        catch
        {
            Console.WriteLine("an error has occured.");
        }

        Thread.Sleep(3000);

        var name = htmlDocument.DocumentNode.Descendants("div")
            .Where(node => node.GetAttributeValue("class", "")
            .Equals("b-options-floated mobile-phone_portrait r-edit")).ToList();

        //var divlink = htmlDocument.DocumentNode.Descendants("div")
        //    .Where(node => node.GetAttributeValue("class", "")
        //    .Equals("container packery")).ToList();

        var alink = htmlDocument.DocumentNode.Descendants("a")
            .Where(node => node.GetAttributeValue("class", "")
            .Equals("b-image")).ToList();

        foreach(var a in alink)
        {
            sources.Add(a.GetAttributeValue("href", string.Empty));
        }

        var tmp = Regex.Replace(name[0].GetDirectInnerText(), "[^a-zA-Z0-9._]", string.Empty);

        root += (tmp+"\\");

        if (!Directory.Exists(root))
        {
            Directory.CreateDirectory(root);
        }

        for (int i = 0; i < sources.Count; i++)
        {
            using (WebClient client = new WebClient())
            {
                var test = sources[i].Split(';').Last().Replace("url=", string.Empty);
                try
                {
                    client.DownloadFile(new Uri(test), root + test.Split('/').Last().Replace("&amp", string.Empty).Replace("?", string.Empty));
                    Console.WriteLine($"Image #{i + 1} download successfully!");
                }
                catch
                {
                    Console.WriteLine($"Image #{i + 1} download unsuccessfully...");
                }
            }
        }

        Thread.Sleep(3000);

        Console.WriteLine("Done!");
        Console.ReadKey();

    }
}

The issue is: Its working probably ones per 30 minutes, i guess? And working not that good, as I expected. Html Parser does not loading the content fully. If link has 100+ pictures, in good condition Im getting like from 5 to 15. If link (for example: https://shikimori.one/animes/1577-taiho-shichau-zo) has around 30 pictures, Its likely can parse it all. (other options not tested. Also tried to parse google pictures, Its loading like one page of the link tho, not reaching the button "More results")

I assume that the site is protected from bots, and therefore it does not always respond to requests from my program, or something like that. This guy as I understand has the same problem, but still no answer. How can this be fixed?

Oguz Ozgul
  • 6,809
  • 1
  • 14
  • 26
Okashi
  • 1
  • 1
  • Thread.Sleep(3000) means the execution of this thread will be delayed for 3 seconds. (roughly speaking..) It is not 30 minutes. This source code first downloads the html content of the web page and then finds – Oguz Ozgul Apr 11 '20 at 20:16
  • @oguz-ozgul, let me explain a little more. There is on issue with it for all time I compiling this code. But, as I said Its loading page ones per ~30m [fully](https://drive.google.com/open?id=1mnMdCKSsgn4PGoyzQXUoJvQHafwxJup4), and after I run this again I have [not fully](https://drive.google.com/open?id=1G6EyTApsDJKqzEgqxqows2LgulxCxiUv) loaded page. You can check yourself what htmlDocument contains. If I wait for ~30m Its again works as first run. Its always loading some html source code, but not always completely. (with ~30m delay between starting a program) – Okashi Apr 11 '20 at 20:36
  • Then as you say, the web site has some kind of restriction on their end. The best suggestion I can make is to set a good user-agent string to your web client (like your user-agent string when requesting this site from Chrome browser) to pass any restrictions which checks for the validity of the user-agent header. – Oguz Ozgul Apr 11 '20 at 20:39
  • I am not very good at this. How can I set user-agent in my code? – Okashi Apr 11 '20 at 20:41
  • https://stackoverflow.com/questions/11841540/setting-the-user-agent-header-for-a-webclient-request The answer here explains how. You can use something like this. Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36 – Oguz Ozgul Apr 11 '20 at 20:51
  • @oguz-ozgul, if it should looks like `web.UserAgent = "Mozilla/5.0 (Linux; Android 6.0.1; SM-G532F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.105 Mobile Safari/537.36"` Its doesnt work, all the same. Found this user-agent [here](https://olegon.ru/user_agents.txt). – Okashi Apr 11 '20 at 20:52
  • Sorry to hear that. Can you please catch exception and print out full details as I mentioned earlier – Oguz Ozgul Apr 11 '20 at 21:18
  • @oguz-ozgul, there is no any exeptions. Its just sometime loading fully, sometime not. You also can remove the try exeption blocks, nothing changes. Its just to be safe. Maybe I didnt get exactly what you want, (here)[https://drive.google.com/open?id=1-VDdjX65FaIihwnjnfRZpZDy8gKC3Zz4] is solution of my program. If you can check it with yourself and help with my problem, I will be very grateful. – Okashi Apr 11 '20 at 21:34
  • No promises, but I will try – Oguz Ozgul Apr 11 '20 at 21:36
  • @oguz-ozgul, Thanks! Let me know if something works out. – Okashi Apr 11 '20 at 21:39
  • I'm sorry, but I messed up the projects a bit. [Here](https://drive.google.com/open?id=1aU_vKENszOVrtFe8MH3tJfGZQHNjajfe) is the last one. – Okashi Apr 11 '20 at 23:16
  • I tried it, and downloaded https://shikimori.org/animes/38256-magia-record-mahou-shoujo-madoka-magica-gaiden-tv but it found no `` there. Count = 0. Sorry. – Oguz Ozgul Apr 12 '20 at 01:55

0 Answers0