Need regular expression to find all phrases in html

Question

I parse html (in c# code as string) and need to get all phrases from html. For example html:

<div><div>text1</div>text2</div>

I want to get array of strings:

text1
text2

If regular expression is impossible, please provide algorithm how to skip all tag names, tag attributes and get only text content.

Update: it is not a dublicate for span problem, becase text can be in any tag, not only span. I need all text, except tags and attributes. Dont want to use HtmlAgility parser.

Update2: found regex (yes, it possible)

    //parse html, save text node in list
    public void FindTextHtml(string html, List<string> list)
    {
        var ms = Regex.Matches(html, @">([^<>]*)<", RegexOptions.IgnoreCase | RegexOptions.Multiline);
        foreach (Match m in ms)
        {
            var text = m.Groups[1].Value;
            list.Add(text);
        }
    }

Full source code available here

Sounds like a [XY problem](http://meta.stackexchange.com/q/66377/158761)? — Soner Gönül, Feb 11 '16 at 11:04
You're trying to hammer in a nail using a screwdriver. If you need to parse HTML, use a HTML parser. — JJJ, Feb 11 '16 at 11:06
I think it is very simple problem and most of developers knows how to solve it. I can solve it by myself, but want to save my time. Thanks for understanding. — Alexey Obukhov, Feb 11 '16 at 11:07
See answer below: no parsers needed. Just regular expression. — Alexey Obukhov, Feb 11 '16 at 11:17
You can use the solution in the original question and to get the array, just split with `\r\n` and trim all the array elements. You should not use regular expressions for this task. — Wiktor Stribiżew, Feb 11 '16 at 11:18

score 2 · Answer 1 · answered Feb 11 '16 at 11:08

2

What you are looking for is here: Grabbing HTML Tags

The matches you are looking for would be in the ...(.*?)... group. Hope this helps

answered Feb 11 '16 at 11:08

Kasper Jensen

548
5
12

ako · Answer 2 · 2016-02-11T11:59:38.550

2

use HtmlAgilityPack dll to parse through XML and HTML files and then use code below to get your text :

        string path = @"path to the file";
        HtmlAgilityPack.HtmlDocument hd = new HtmlAgilityPack.HtmlDocument();
        hd.Load(path);
        string result= hd.DocumentNode.InnerText.Trim();

that is all of what you need

edited Feb 11 '16 at 11:59

answered Feb 11 '16 at 11:15

ako

2,000
2
28
34

1

good approach, I use it if can not find regular expression. Dont want to add addition libraries to my project. – Alexey Obukhov Feb 11 '16 at 11:19
just add HtmlAgilityPck dll to your references – ako Feb 11 '16 at 11:21
1

Check my answer in the original question. Perhaps, `return HtmlAgilityPack.HtmlEntity.DeEntitize(doc.DocumentNode.InnerText.Trim());` is better for the current scenario. – Wiktor Stribiżew Feb 11 '16 at 11:24
1

Dont want use HtmlAgilityPack. Only build in .NET libraries. – Alexey Obukhov Feb 11 '16 at 11:28
1

@WiktorStribiżew your code gives exactly the same output as mines ( above code ) – ako Feb 11 '16 at 11:29
1

@ako: Only with the given input. – Wiktor Stribiżew Feb 11 '16 at 11:32

Need regular expression to find all phrases in html

2 Answers2