Detect particular tokens in a string. C#

Question

I have a very large string (HTML) and in this HTML there is particular tokens where all of them starts with "#" and ends with "#"

Simple Eg

<html>
<body>
      <p>Hi #Name#, You should come and see this #PLACE# - From #SenderName#</p>
</body>
</html>

I need a code that will detect these tokens and will put it in a list. 0 - #Name# 1 - #Place# 2 - #SenderName#

I know that I can use Regex maybe, anyway have you got some ideas to do that?

score 12 · Answer 1 · edited May 23 '17 at 12:24

12

You can try:

// using System.Text.RegularExpressions;
// pattern = any number of arbitrary characters between #.
var pattern = @"#(.*?)#";
var matches = Regex.Matches(htmlString, pattern);

foreach (Match m in matches) {
    Console.WriteLine(m.Groups[1]);
}

Answer inspired in this SO question.

edited May 23 '17 at 12:24

Community

1
1

answered Nov 25 '10 at 13:33

Pablo Santa Cruz

176,835
32
241
292

2

+1 yes - considered using the non-greedy .* match too; although should it be .+? – Andras Zoltan Nov 25 '10 at 13:37
2

Will this fail to parse a text like this: `Hi #Name#where#PLACE# more text`, or have i misunderstood something regarding how RegEx works. It might not be a valid problem for OP either, so it's just for my own curiosity :) – Øyvind Bråthen Nov 25 '10 at 13:46
Yes. I think will fail with `Hi #Name#where#PLACE# more text`. – Pablo Santa Cruz Nov 25 '10 at 13:48
See VladV's answer. It will actually work out fine. Then I learned something new today also :) – Øyvind Bråthen Nov 25 '10 at 14:27

score 11 · Accepted Answer · answered Nov 25 '10 at 13:41

Yes you can use regular expressions.

string test = "Hi #Name#, You should come and see this #PLACE# - From #SenderName#";
Regex reg = new Regex(@"#\w+#");
foreach (Match match in reg.Matches(test))
{
    Console.WriteLine(match.Value);
}

As you might have guessed \w denotes any alphanumeric character. The + denotes that it may appear 1 or more times. You can find more info here msdn doc (for .Net 4. You'll find other versions there as well).

score 4 · Answer 3 · answered Nov 25 '10 at 13:35

4

A variant without Regex if you like:

var splitstring = myHtmlString.Split('#');
var tokens = new List<string>();
for( int i = 1; i < splitstring.Length; i+=2){
  tokens.Add(splitstring[i]);
}

answered Nov 25 '10 at 13:35

Øyvind Bråthen

59,338
27
124
151

Why a downvote on this? It will procuce the required results. I would appreciate a reason from the downvoter. – Øyvind Bråthen Nov 25 '10 at 14:50
1

it works, i'll give it a +1 to make up for the person who loves regex too much. – tim Nov 25 '10 at 15:14

score 3 · Answer 4 · answered Nov 25 '10 at 13:33

3

foreach (Match m in Regex.Matches(input, @"#\w+#"))
    Console.WriteLine("'{0}' found at index {1}.",  m.Value, m.Index);

answered Nov 25 '10 at 13:33

VladV

10,093
3
32
48

How will this parse `Hi #Name#where#PLACE# more text` correctly. Doesn't this parse words "outside" the hashes also as long as it's a single word? Or are I mistaken here? – Øyvind Bråthen Nov 25 '10 at 13:36
Just verified - on your example it gives "#Name#" and "#PLACE#". When multiple matches are considered, each of them starts after the previous one ends - that is, after "#Name#" is matched, it starts looking for a next match after the second hash sign. – VladV Nov 25 '10 at 13:50
+1: That is perfect. I see why now, since the # is actually "used" by the first match, and therefore cannot be used by the second also. Thanks for the enlightment. – Øyvind Bråthen Nov 25 '10 at 14:28

Dean Chalk · Answer 5 · 2010-11-25T15:30:31.773

3

try this

var result = html.Split('#')
                    .Select((s, i) => new {s, i})
                    .Where(p => p.i%2 == 1)
                    .Select(t => t.s);

Explanation:

line1 - we split the text by the character '#'

line2 - we select a new anonymous type, which includes the strings position in the array, and the string itself

line3 - we filter the list of anonymous objects to those that have an odd index value - effectively picking 'every other' string - this fits in with finding those strings that were wrapped in the hash character, rather than those outside

line4 = we strip away the indexer, and return just the string from the anonymous type

edited Nov 25 '10 at 15:30

answered Nov 25 '10 at 13:39

Dean Chalk

20,076
6
59
90

+1 for using the `Select` overload that gives you the index in addition to the value that I think all are aware of. – Øyvind Bråthen Nov 25 '10 at 14:31
Nice and short, but would you mind explaining it a bit further? s,i, p? perhaps use "explaining" variables would make it more educational for others. – BerggreenDK Nov 25 '10 at 15:21

Aliostad · Answer 6 · 2010-11-25T13:34:58.430

2

Use:

MatchCollection matches = Regex.Matches(mytext, @"#(\w+)#");

foreach(Match m in matches)
{
    Console.WriteLine(m.Groups[1].Value);
}

edited Nov 25 '10 at 13:34

answered Nov 25 '10 at 13:32

Aliostad

80,612
21
160
208

score 2 · Answer 7 · answered Nov 25 '10 at 13:37

2

Naive solution:

var result = Regex
    .Matches(html, @"\#([^\#.]*)\#")
    .OfType<Match>()
    .Select(x => x.Groups[1].Value)
    .ToList();

answered Nov 25 '10 at 13:37

Darin Dimitrov

1,023,142
271
3,287
2,928

score 1 · Answer 8 · answered Nov 25 '10 at 13:41

1

Linq solution:

        string s = @"<p>Hi #Name#, 
          You should come and see this #PLACE# - From #SenderName#</p>";

        var result = s.Split('#').Where((x, y) => y % 2 != 0).Select(x => x);

answered Nov 25 '10 at 13:41

nan

19,595
7
48
80

Nice and short, but would you mind explaining it a bit further? x,y? perhaps use "explaining" variables would make it more educational for others. – BerggreenDK Nov 25 '10 at 15:20
@BerggreenDK Of course, the method `Where` is overloaded. `(x,y)` is a pair, where `x` is current item of the collection and `y` is the index of this item. Yes, your're right, I could have used `Where(item,index)` for better readability. After I choose only odd strings, because they are those we need. – nan Nov 25 '10 at 18:00

score 0 · Answer 9 · answered Nov 25 '10 at 13:36

Use the Regex.Matches method with a pattern of something like

#[^#]+# for the pattern.

Which is possibly the most naive way.

This might then need to be adjusted if you wish to avoid including the '#' characters in the output match, possibly with a lookaround:

(?<=#)[^#]+(?=#)

(A match value for this would be 'hello' not '#hello#' - so you don't have to do any more trimming)

score 0 · Answer 10 · answered Nov 25 '10 at 13:37

This gives you a list of the tokens as requested:

var tokens = new List<string>();
var matches = new Regex("(#.*?#)").Matches(html);

foreach (Match m in matches) 
    tokens.Add(m.Groups[1].Value);

Edit: If you don't want the pound characters included, just move them outside the parentheses in the Regex string (see Pablo's answer).

Detect particular tokens in a string. C#

10 Answers10

Linked