1

I'm writing a program that uses an API which crawls the web to fetch pieces of information. I am then showing the results on an HTML Document (Excel was an option, but due to some restrictions and features I opted for an Html Document)

Long story short, the request result often have links to an image and I need to show them to the user. The problem is that sometimes some images don't exist anymore.

The basic is as follow:

string carouselItems = "";

for (int i = 0; i < imgs.Count; i++)
{

    if (i == 0)
        carouselItems += $"" +
            $"<div class=\"carousel-item active\">" +
            $"  <img src = \"" + imgs[i] + "\" class=\"d-block w-100\" alt=\"Bild "
                + (i + 1).ToString() + " von " + imgs.Count.ToString()
                + "\" onerror=\"this.style.display = 'none'\">" +
            $"  <div class=\"carousel-caption d-none d-md-block\">" +
            $"      <p>Bild " + (i + 1).ToString() + "</p>" +
            $"  </div>" +
            $"</div>";
    else
        carouselItems += $"" +
            $"<div class=\"carousel-item\">" +
            $"  <img src = \"" + imgs[i] + "\" class=\"d-block w-100\" alt=\"Bild "
                + (i + 1).ToString() + " von " + imgs.Count.ToString()
                + "\" onerror=\"this.style.display = 'none'\">" +
            $"  <div class=\"carousel-caption d-none d-md-block\">" +
            $"      <p>Bild " + (i + 1).ToString() + "</p>" +
            $"  </div>" +
            $"</div>";
}

I thought to make an HttpRequest to the link and to hold on only to those images that returned a content type of "image", because sometime some links redirect me to a website (probably because the image doesn't exist anymore), like this:

for (int i = 0; i < line.Bild.Count; i++)
{
    var client = new RestClient(line.Bild[i]);
    var request = new RestRequest(Method.GET);
    client.FollowRedirects = true;
    request.AddHeader("Accept", "image/*");
    var res = client.Execute(request);
    if (res.ContentType.Contains("image/")) images.Add(line.Bild[i]);
}

This approach is fine, but "slow" whenever I have hundreds of images to check. What I would like to achieve is some form of parallelization for which all images are checked at the same time and I am returned with a list of unbroken images. I am not so good with asynchronous coding that's why I need some help.

I will be happy to read your answers! BR

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
EDZ III
  • 33
  • 5
  • 2
    Checking hundreds of images is always going to be slow, you're limited by bandwidth for a start. – DavidG Nov 19 '20 at 12:02
  • In a scenario where I have unlimited bandwidth, what could be done to help me with my problem? – EDZ III Nov 19 '20 at 12:07
  • So, what I am reading out of this is: 1. You have a list of img-links (or at least supposedly images). 2. In a first step, you want to figure out if they are broken or not. 3. Process only not-broken images (maybe return a default "not available" otherwise?) Is that correct? Ideally you want parallization to speed up the process, right? – Fildor Nov 19 '20 at 12:16
  • 1
    ^^ I'd first figure out how to efficiently check "brokenness" for 1 image. Then I'd make that an ActionBlock (or whatever is appropriate). Then I'd set up a [DataFlow](https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/dataflow-task-parallel-library) Pipeline with desired parallelism configured and post the links to it. – Fildor Nov 19 '20 at 12:21
  • 1
    @EDZIII Can you use `HEAD` instead of `GET`? Or if the target server does not allow `HEAD` you can give it a try to issue requests with `HttpCompletionOption.ResponseHeadersRead` – Peter Csala Nov 19 '20 at 12:26
  • @Fildor 1)I need to pick out the broken image links and show the user only the available ones 2) sounds a little too complicated to my skill level and have to check if it is compliant with my companys policy – EDZ III Nov 19 '20 at 13:51
  • 1
    It sounds way more intimidating than it is, actually. – Fildor Nov 19 '20 at 16:16
  • As a side note, using [string concatenation](https://stackoverflow.com/questions/21078/most-efficient-way-to-concatenate-strings) to form the HTML result is cringy. And duplicating a chunky HTML fragment just to add the `active` CSS class is also shivering. I understand that your coding skills are under development right now. I am just pointing out to problems that are particularly serious, and you may want to prioritize fixing them. The second issue can be addressed with the ternary operator: `+ (i == 0 ? " active" : "") +` – Theodor Zoulias Nov 19 '20 at 17:42

2 Answers2

2

You should start by implementing an asynchronous method that checks the existence of individual image URLs. This method should have a return value of Task<bool> instead of bool, since it is asynchronous. Below is a basic implementation, that you may want to customize to your needs:

private HttpClient _httpClient = new HttpClient();

private async Task<bool> CheckImageAsync(string imageUrl)
{
    try
    {
        byte[] imageBytes = await _httpClient.GetByteArrayAsync(imageUrl);
        using (var ms = new MemoryStream(imageBytes))
        {
            var image = Image.FromStream(ms);
        }
        return true;
    }
    catch (Exception)
    {
        return false;
    }
}

Then inside your main method that does the fetch/parse/extract work, you can use the above method to create one task for each unverified image URL in the raw HTML. All these tasks will start running immediately and will be running concurrently. You must wait all of them to complete before continuing. You can wait them either synchronously or asynchronously. Below is an example with synchronous waiting, that you may be more familiar with. It uses LINQ and value tuples to combine each URL with a bool result.

var unverifiedImageUrls = new List<string>();
// Here put the code that extracts the images from the raw HTML,
// and puts them into the list.

Task<(string Url, bool Verified)>[] verifyTasks = unverifiedImageUrls
    .Select(async url => (Url: url, Verified: await CheckImageAsync(url)))
    .ToArray();

string[] verifiedImageUrls = Task.WhenAll(verifyTasks).Result
    .Where(result => result.Verified)
    .Select(result => result.Url)
    .ToArray();

// Here use the resulting array to generate your HTML result

If you prefer to make everything asynchronous, you should remove the .Result and add an await before the Task.WhenAll. But this change will require changes to other parts of your program as well.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
1

You could save a lot of time just by switching GET to HEAD, which tells to receive metadata only, not the actual file.

using System;
using System.Net.Http;
using System.Threading.Tasks;

namespace ConsoleApp3
{
    class Program
    {
        static HttpClient httpClient = new HttpClient();

        static async Task<bool> IsImageExists(string url)
        {
            var result = await httpClient.SendAsync(new HttpRequestMessage()
            {
                RequestUri = new Uri(url),
                Method = HttpMethod.Head,
            });

            return result.IsSuccessStatusCode;
        }

        static async Task Main(string[] args)
        {
            bool exists = await IsImageExists("https://upload.wikimedia.org/wikipedia/en/9/95/Test_image.jpg");

            if (!exists)
            {
                // do something else to confirm that image does not exist, e.g. your regular GET request
            }
        }
    }
}
Yehor Androsov
  • 4,885
  • 2
  • 23
  • 40