0

I used WebClient in C# to get an html doc of a Youtube video. Now I'm trying to get a Youtube comment out of the doc, but it's not working because different comments that use the same element (yt-formatted-string) have different attributes(class, id,span, and so on). So I'm trying to get regex to complete them for me and just get to the end tag (>).

Tried to use "." in regex, kind of like using the re module in python: re.compile(r('.')) in python, where it takes spaces,symbol, and characters and just completes them for me. Not sure if that even exists in C#, but I hope so.

        WebClient web = new WebClient();
        String content = web.DownloadString(@"https://www.youtube.com/watch?v=hE73JvEc2pQ");

        MatchCollection matches = Regex.Matches(content, @"<yt-formatted-string\.>\s*(.+?)\s*</yt-formatted-string>", RegexOptions.Multiline);
        foreach (Match match in matches)
        {
            textComment.Text = $"\n{match.Groups[1].Value}";
        }

Got nothing.

Want the Regex to complete attributes for me, like so:

Html line:

yt-formatted-string id="content-text" slot="content" split-lines="" class="style-scope ytd-comment-renderer">

Imaginary c sharp code that allows me to complete attributes:

"yt-formatted-string(complete all the attributes here)>\s*(.+?)\s*</yt-formatted-string>"
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Nizar K
  • 29
  • 3
  • For some reason the hrml elements (yt-formatted-string) keep getting deleted!! – Nizar K Feb 01 '19 at 17:22
  • 2
    I'm not sure what you mean by "complete an element", but I would recommend something like XPath rather than regex to extract data from HTML. Check this out: https://learn.microsoft.com/en-us/dotnet/standard/data/xml/select-nodes-using-xpath-navigation – IPValverde Feb 01 '19 at 17:25

2 Answers2

1

you don't need to deal with such a complicated parsing. Just use Youtube Data API

Check This API

Derviş Kayımbaşıoğlu
  • 28,492
  • 4
  • 50
  • 72
0

For cases where an API is not available, you should also avoid trying to parse html with a regex, and instead parse it as XML. See https://stackoverflow.com/a/1732454/6055952 for more information.

Matthew Varga
  • 405
  • 4
  • 14