1

I need to parse html meta keywords using regex. Source string is always in same format like:

<meta name="description" content="description text">
<meta name="keywords" content="Keyword1, Keyword2, Keyword3...">
<link rel="alternate" type="application/xml+rss" href="http://example.com/rss">

I wont to get Keyword1, Keyword2 and Keyword3 as List < string >

Ryan Kreager
  • 3,571
  • 1
  • 20
  • 35
Sergey Sypalo
  • 1,223
  • 5
  • 16
  • 35

3 Answers3

2

Regex is not a good choice for parsing HTML files..

HTML is not strict nor is it regular with its format..

Use htmlagilitypack

You can use this code to retrieve all Keywords using HtmlAgilityPack

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://yourWebSite.com");

List<String> keyLst= doc.DocumentNode
                        .SelectSingleNode("//meta[@name='keywords']")
                        .Attributes["content"].Value
                        .Split(',').ToList();

keyLst now contain all Keywords

Anirudha
  • 32,393
  • 7
  • 68
  • 89
2

Description

if you're looking for a simple regex solution and your input isn't complex then you can try this

<meta\b[^>]*\bname=["]keywords["][^>]*\bcontent=(['"]?)((?:[^,>"'],?){1,})\1[>] which will pull the value in the content field.

enter image description here

Group 1 is the open quote, which would then be required to close at the end of the value. Group 2 is the contents which could then be split on the comma.

Disclaimer

This expression could fail on some simple edge cases which is why regex shouldn't be used for parsing HTML, instead you should look to use a html parsing engine.

C# Example

using System;
using System.Text.RegularExpressions;
namespace myapp
{
  class Class1
    {
      static void Main(string[] args)
        {
          String sourcestring = "source string to match with pattern";
          Regex re = new Regex(@"<meta\b[^>]*\bname=[""]keywords[""][^>]*\bcontent=(['""]?)((?:[^,>""'],?){1,})\1[>]",RegexOptions.IgnoreCase);
          MatchCollection mc = re.Matches(sourcestring);
          int mIdx=0;
          foreach (Match m in mc)
           {
            for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
              {
                Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames()[gIdx], m.Groups[gIdx].Value);
              }
            mIdx++;
          }
        }
    }
}

$matches Array:
(
    [0] => Array
        (
            [0] => <meta name="keywords" content="Keyword1, Keyword2, Keyword3...">
        )

    [1] => Array
        (
            [0] => "
        )

    [2] => Array
        (
            [0] => Keyword1, Keyword2, Keyword3...
        )

)
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
0

I wish I could comment instead of submitting this as an answer, but my rep is too low :(

I understand the need to perform regex sometimes, but as everyone else recommends, it's just preferred to use a standard XML or HTML parser. It's safer in terms of encompassing unintended input variations and can even be faster.

See: https://stackoverflow.com/a/701177/1002098

Community
  • 1
  • 1
delrocco
  • 495
  • 1
  • 4
  • 23