0

The problem is in the result variable. There are more then some places with jpg. What I want is to get all the places ending with jpg but as string. I mean that result will have one link ending with jpg then again result will be with another link ending with jpg.

it's like:

https://something.com/my.jpg/a7gfefg/https://something.com/my2.jpg/sadsadsad64567546/https://something.com/my3.jpg

and I want in result to get each time:

https://something.com/my.jpg

then in the next iterate:

https://something.com/my2.jpg

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace Testing
{
    public partial class Form1 : Form
    {
        private List<string> links = new List<string>();
        string htmlCode;

        public Form1()
        {
            InitializeComponent();

            GetLinks();
        }

        private void GetLinks()
        {
            using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
            {
                htmlCode = client.DownloadString("https://test.com/my-site");
            }

            int index1 = 0;

            using (StringReader reader = new StringReader(htmlCode))
            {
                string line;
                while ((line = reader.ReadLine()) != null)
                {
                    int index = line.IndexOf("https://test.com");
                    if (index != -1)
                    {
                        index1 = line.IndexOf("png", index);
                    }
                    if (index != -1 && index1 != -1)
                    {
                        string result = line.Substring(index, index1);
                    }
                }
            }

        }
        private void Form1_Load(object sender, EventArgs e)
        {

        }
    }
}

2 Answers2

2

You're passing in the web page's html to the File.ReadAllLines method as if it's a file name. You already have the html content as a string variable. Remove the line, and rename 'content' to 'htmlCode':

private void GetLinks()
{
    using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
    {
        // Get the html content without saving it to a file
        htmlCode = client.DownloadString("https://my-site");
    }

    using (StringReader reader = new StringReader(input))
    {
        string line;
        while ((line = reader.ReadLine()) != null)
        {
            int index = line.IndexOf("https://something/something");
            int index1 = line.IndexOf(".jpg", index);
            string result = line.Substring(index,index1);
        }
    }
}

A regex to find everything starting with https://test.com/ and ending with .jpg could look like this:

https://test.com/.+\.jpg

. is a special character in a regex, which matches anything. The * after the dot means 'one or more of the preceeding pattern'. The next . before the jpg extension has to be escaped with a back slash because it's a special character. Note that when putting into a C# stirng literal, the back slashes then have to be escaped:

"https://test.com/.+\\.jpg"
Andrew Williamson
  • 8,299
  • 3
  • 34
  • 62
  • that's the problem I faced before. the var line is type char so there is no property IndexOf for line. – Oled Neduda Mar 21 '23 at 19:20
  • The content is one big string. When you iterate through it, you're iterating through each character in the string, instead of each line. Have a look at this solution: https://stackoverflow.com/a/1500257/2363967. I'll update the answer to include this. – Andrew Williamson Mar 21 '23 at 19:23
  • I edited my question again updating what I'm trying to archive and what is the problem now. sorry for the mess. – Oled Neduda Mar 21 '23 at 19:49
0

The better way to extract image url within html code is using Regular Expression.

Image url extraction regular expression:

(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|gif|png)

For how to use Regular Expressions in C#: https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference