0

I'm trying to parse text from an attribute: src="/captcha?58428805".
I need the text /captcha?58428805, every time it's different.
How can I parse it?

Example element:

<img style="margin: 0;height:40px;width:115px;" 
     width="115" height="40"
     id="captcha" class="captcha" src="/captcha?58428805" 
     alt="Verification code with letters and numbers "/>
EpicKip
  • 4,015
  • 1
  • 20
  • 37
  • 1
    You could read about [`IndexOf`](https://msdn.microsoft.com/en-us/library/system.string.indexof(v=vs.110).aspx) and [`SubString`](https://msdn.microsoft.com/en-us/library/system.string.substring(v=vs.110).aspx) . You could also use a Regular Expression – Pikoh May 09 '17 at 09:06

2 Answers2

0

There are various kinds of methods, as @Pikoh said in comments, and I wrote the Regex version for you. The regex string may change a little according how variant your html strings are.

    static void Main(string[] args)
    {
        string input = "your html string";
        string strReg = @"<img style=.+?src=""(.+?)""";
        Regex reg = new Regex(strReg,
            RegexOptions.Compiled | RegexOptions.Singleline);
        string youneed = reg.Match(input).Groups[1].Value;
        Console.WriteLine(youneed);
        Console.ReadLine();
    }
Lei Yang
  • 3,970
  • 6
  • 38
  • 59
0

As Lei Yang's answer might be correct, it will fail if src=SRC_VALUE comes right after <image.. like this: <img src="/captcha?58428805" ...SOME_OTHER ATTR..>

This regex might help:

string toTest = @"<img style=""margin: 0;height:40px;width:115px;"" width=""115"" height=""40"" id=""captcha"" class=""captcha"" src=""/captcha?58428805"" alt="" Verification code with letters and numbers ""/>";
var regex = new Regex(@"<img.{0,}src=""(.+?)""");
Console.WriteLine(regex.Match(toTest).Groups[1].Value);

Explanation for <img.{0,}src="(.+?)" (note that quotes are escaped in the above code):

<img - string should contain <img

.{0,} - matches between zero to infinite occurences of any character except line terminators after the <img

src=" - matches the src=" part after <img

(.+?)" - . means any character except line terminators, (+) occuring 1 or unlimited times, (?) lazy, and should end in ".

This regex however will only return the last src value even if your toTest string contains multiple <img> tags. So, you need to Split your string per <img> tag then apply the regex above:

string toTest = @"<img style=""margin: 0;height:40px;width:115px;"" width=""115"" height=""40"" id=""captcha"" class=""captcha"" src=""/captcha?58428805"" alt="" Verification code with letters and numbers ""/><img style=""margin: 0;height:40px;width:115px;"" width=""115"" height=""40"" id=""captcha"" class=""captcha"" src=""/captssscha?5842sss8805"" alt="" Verification code with letters and numbers ""/>";
var imgArr = Regex.Split(toTest, @"(<img[\s\S]+?\/>)").Where(l => l != string.Empty).ToArray(); //split the html string by <image> tag
var srcRegex = new Regex(@"<img.{0,}src=""(.+?)""",RegexOptions.Compiled | RegexOptions.Singleline);

foreach(string imgTag in imgArr) {
    Console.WriteLine(srcRegex.Match(imgTag).Groups[1].Value);
}
Community
  • 1
  • 1
xGeo
  • 2,149
  • 2
  • 18
  • 39