13

I have a very large string (HTML) and in this HTML there is particular tokens where all of them starts with "#" and ends with "#"

Simple Eg

<html>
<body>
      <p>Hi #Name#, You should come and see this #PLACE# - From #SenderName#</p>
</body>
</html>

I need a code that will detect these tokens and will put it in a list. 0 - #Name# 1 - #Place# 2 - #SenderName#

I know that I can use Regex maybe, anyway have you got some ideas to do that?

David Bonnici
  • 6,677
  • 12
  • 54
  • 72

10 Answers10

12

You can try:

// using System.Text.RegularExpressions;
// pattern = any number of arbitrary characters between #.
var pattern = @"#(.*?)#";
var matches = Regex.Matches(htmlString, pattern);

foreach (Match m in matches) {
    Console.WriteLine(m.Groups[1]);
}

Answer inspired in this SO question.

Community
  • 1
  • 1
Pablo Santa Cruz
  • 176,835
  • 32
  • 241
  • 292
11

Yes you can use regular expressions.

string test = "Hi #Name#, You should come and see this #PLACE# - From #SenderName#";
Regex reg = new Regex(@"#\w+#");
foreach (Match match in reg.Matches(test))
{
    Console.WriteLine(match.Value);
}

As you might have guessed \w denotes any alphanumeric character. The + denotes that it may appear 1 or more times. You can find more info here msdn doc (for .Net 4. You'll find other versions there as well).

John Sloper
  • 1,813
  • 12
  • 14
4

A variant without Regex if you like:

var splitstring = myHtmlString.Split('#');
var tokens = new List<string>();
for( int i = 1; i < splitstring.Length; i+=2){
  tokens.Add(splitstring[i]);
}   
Øyvind Bråthen
  • 59,338
  • 27
  • 124
  • 151
3
foreach (Match m in Regex.Matches(input, @"#\w+#"))
    Console.WriteLine("'{0}' found at index {1}.",  m.Value, m.Index);
VladV
  • 10,093
  • 3
  • 32
  • 48
  • How will this parse `Hi #Name#where#PLACE# more text` correctly. Doesn't this parse words "outside" the hashes also as long as it's a single word? Or are I mistaken here? – Øyvind Bråthen Nov 25 '10 at 13:36
  • Just verified - on your example it gives "#Name#" and "#PLACE#". When multiple matches are considered, each of them starts after the previous one ends - that is, after "#Name#" is matched, it starts looking for a next match after the second hash sign. – VladV Nov 25 '10 at 13:50
  • +1: That is perfect. I see why now, since the # is actually "used" by the first match, and therefore cannot be used by the second also. Thanks for the enlightment. – Øyvind Bråthen Nov 25 '10 at 14:28
3

try this

var result = html.Split('#')
                    .Select((s, i) => new {s, i})
                    .Where(p => p.i%2 == 1)
                    .Select(t => t.s);

Explanation:

line1 - we split the text by the character '#'

line2 - we select a new anonymous type, which includes the strings position in the array, and the string itself

line3 - we filter the list of anonymous objects to those that have an odd index value - effectively picking 'every other' string - this fits in with finding those strings that were wrapped in the hash character, rather than those outside

line4 = we strip away the indexer, and return just the string from the anonymous type

Dean Chalk
  • 20,076
  • 6
  • 59
  • 90
  • +1 for using the `Select` overload that gives you the index in addition to the value that I think all are aware of. – Øyvind Bråthen Nov 25 '10 at 14:31
  • Nice and short, but would you mind explaining it a bit further? s,i, p? perhaps use "explaining" variables would make it more educational for others. – BerggreenDK Nov 25 '10 at 15:21
2

Use:

MatchCollection matches = Regex.Matches(mytext, @"#(\w+)#");

foreach(Match m in matches)
{
    Console.WriteLine(m.Groups[1].Value);
}
Aliostad
  • 80,612
  • 21
  • 160
  • 208
2

Naive solution:

var result = Regex
    .Matches(html, @"\#([^\#.]*)\#")
    .OfType<Match>()
    .Select(x => x.Groups[1].Value)
    .ToList();
Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
1

Linq solution:

        string s = @"<p>Hi #Name#, 
          You should come and see this #PLACE# - From #SenderName#</p>";

        var result = s.Split('#').Where((x, y) => y % 2 != 0).Select(x => x);
nan
  • 19,595
  • 7
  • 48
  • 80
  • Nice and short, but would you mind explaining it a bit further? x,y? perhaps use "explaining" variables would make it more educational for others. – BerggreenDK Nov 25 '10 at 15:20
  • @BerggreenDK Of course, the method `Where` is overloaded. `(x,y)` is a pair, where `x` is current item of the collection and `y` is the index of this item. Yes, your're right, I could have used `Where(item,index)` for better readability. After I choose only odd strings, because they are those we need. – nan Nov 25 '10 at 18:00
0

Use the Regex.Matches method with a pattern of something like

#[^#]+# for the pattern.

Which is possibly the most naive way.

This might then need to be adjusted if you wish to avoid including the '#' characters in the output match, possibly with a lookaround:

(?<=#)[^#]+(?=#)

(A match value for this would be 'hello' not '#hello#' - so you don't have to do any more trimming)

Andras Zoltan
  • 41,961
  • 13
  • 104
  • 160
0

This gives you a list of the tokens as requested:

var tokens = new List<string>();
var matches = new Regex("(#.*?#)").Matches(html);

foreach (Match m in matches) 
    tokens.Add(m.Groups[1].Value);

Edit: If you don't want the pound characters included, just move them outside the parentheses in the Regex string (see Pablo's answer).

Mark Bell
  • 28,985
  • 26
  • 118
  • 145