Correctly use regular expressions to extract word

Question

I've got an ASP.NET Core project that requires me to read the response from a website and extract a certain word.

What I've tried was to replace the tags with white space, and remove the tags. Unfortunately, I'm not getting any where with this. What is a better approach?

I want to extract Toyota from these html tags

<tr>
<td class="text-muted">Car Model</td>
<td><strong>Toyota 2015</strong></td>
</tr>

I've tried:

var documentSource = streamReader.ReadToEnd();
//removes html content
Regex remove = new Regex(@"<[^>].+?>");
var strippedSource = remove.Replace(documentSource.Replace("\n", ""), "");
//convert to array
string[] siteContextArray = strippedSource.Split(',');
//matching string
var match = new Regex("Car Model ([^2015]*)");

List<Model> modelList = new List<Model>();
Model model = new Model();

foreach (var item in siteContextArray)
{
    var wordMatch = match.Match(item);
    if (wordMatch.Success)
    {
        model.Add(
            new Model
            {
                CarModel = wordMatch.Groups[1].Value
            }
        );
    }
}
return modelList;

Please don't use regex to parse HTML, use an HTML parser instead. — Tim Biegeleisen, Aug 27 '19 at 01:26
Hi @TimBiegeleisen I'm glad you've mentioned this. I've never heard of an HTML parser. How would I approach this? — Jenny From the Block, Aug 27 '19 at 01:34
Definitive answer regarding parsing with regex: https://stackoverflow.com/a/1732454/4665 — Jon P, Aug 27 '19 at 02:05
The goto HTML parser for .net is [HTML Agility Pack](http://html-agility-pack.net/?z=codeplex) — Jon P, Aug 27 '19 at 02:08

score 0 · Accepted Answer · answered Aug 27 '19 at 02:24

Use NuGet to retrieve HTML Agility Pack on your solution.

Usage

var html = @"
<tr>
    <td class=""text-muted"">Car Model</td>
    <td><strong> Toyota 2015 </strong></td>
</tr>
<tr>
    <td class=""text-muted"">Car Model</td>
    <td><strong> Toyota 2016 </strong></td>
</tr>";

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var models = htmlDoc.DocumentNode
    .SelectNodes("//tr/td[text()='Car Model']")
    .Select(node => node.SelectSingleNode("following-sibling::*[1][self::td]").InnerText);

By the way, I think it would be nice to add css class on the content element like

<td class="car-model"><strong> Toyota 2016 </strong></td>

Which will make the html more meaningful and easier to extract.

Thank you so much for helping me. I have tried your code but it didn't work as expected. Do I need to enter in the class this way? `//tr/td[text-muted='Car Model']` — Jenny From the Block, Aug 27 '19 at 15:14

Correctly use regular expressions to extract word

1 Answers1