5

I am hopeless with regex (c#) so I would appreciate some help:

Basicaly I need to parse a text and I need to find the following information inside the text:

Sample text:

KeywordB:***TextToFind* the rest is not relevant but **KeywordB: Text ToFindB and then some more text.

I need to find the word(s) after a certain keyword which may end with a “:”.

[UPDATE]

Thanks Andrew and Alan: Sorry for reopening the question but there is quite an important thing missing in that regex. As I wrote in my last comment, Is it possible to have a variable (how many words to look for, depending on the keyword) as part of the regex?

Or: I could have a different regex for each keyword (will only be a hand full). But still don't know how to have the "words to look for" constant inside the regex

Flo
  • 729
  • 3
  • 10
  • 22
  • Regular expression syntax is slightly different if you're using a Linux oriented technology or a Microsoft oriented technology so you might want to tag which one you're working with. – Spencer Ruport Jan 18 '09 at 01:03

3 Answers3

6

The basic regex is this:

var pattern = @"KeywordB:\s*(\w*)";
    \s* = any number of spaces
    \w* = 0 or more word characters (non-space, basically)
    ()  = make a group, so you can extract the part that matched

var pattern = @"KeywordB:\s*(\w*)";
var test = @"KeywordB: TextToFind";
var match = Regex.Match(test, pattern);
if (match.Success) {
    Console.Write("Value found = {0}", match.Groups[1]);
}

If you have more than one of these on a line, you can use this:

var test = @"KeywordB: TextToFind KeyWordF: MoreText";
var matches = Regex.Matches(test, @"(?:\s*(?<key>\w*):\s?(?<value>\w*))");
foreach (Match f in matches ) {
    Console.WriteLine("Keyword '{0}' = '{1}'", f.Groups["key"], f.Groups["value"]);
}

Also, check out the regex designer here: http://www.radsoftware.com.au/. It is free, and I use it constantly. It works great to prototype expressions. You need to rearrange the UI for basic work, but after that it's easy.

(fyi) The "@" before strings means that \ no longer means something special, so you can type @"c:\fun.txt" instead of "c:\fun.txt"

Andrew
  • 8,322
  • 2
  • 47
  • 70
  • Just one more thing: in some cases the value can be 2 words rather then one word. Any Suggestions? – Flo Jan 18 '09 at 01:51
  • How is the regex supposed to know it should match two words instead of one? – Alan Moore Jan 18 '09 at 02:47
  • @Andrew, do you realize almost every thing in that regex is optional? It could legally match just a colon. You should replace `\w*` with `\w+`. Also, I don't see any need to enclose the whole thing in parens, nor for that `\s*` at the beginning. – Alan Moore Jan 18 '09 at 02:56
  • @Alan So there is no way to tell regex not just to "get" the first but also 2 second wird which are both seperated by a space? – Flo Jan 18 '09 at 16:36
  • Yes, it could be more complete, more robust, etc, but this wasn't quite production code :) I'll update it. Also, the only real good way to match more than one word is to ensure that the ":" is right after the keyword – Andrew Jan 19 '09 at 03:39
  • @Flo, if the string is "KeywordB: word1 word2 more text", how can you know if the regex is supposed to match "word2"? Is there something about the second word that distinguishes it from the following text? – Alan Moore Jan 20 '09 at 12:07
  • Alan, Is it possible to have a variable (how many words to look for, depending on the keyword) as part of the regex?,Or: I could have a different regex for each keyword (will only be a hand full). But still don't know how to have the "words to look for" constant inside the regex – Flo Jan 20 '09 at 20:46
  • Damn, this is WAY harder than I thought... really. I've done something like this before with a regex. You might be better off just looking for something like the "key:" pattern, and then extracting everything after that and before the next occurance. – Andrew Jan 20 '09 at 22:45
5

Let me know if I should delete the old post, but perhaps someone wants to read it.

The way to do a "words to look for" inside the regex is like this:

regex = @"(Key1|Key2|Key3|LastName|FirstName|Etc):"

What you are doing probably isn't worth the effort in a regex, though it can probably be done the way you want (still not 100% clear on requirements, though). It involves looking ahead to the next match, and stopping at that point.

Here is a re-write as a regex + regular functional code that should do the trick. It doesn't care about spaces, so if you ask for "Key2" like below, it will separate it from the value.

string[] keys = {"Key1", "Key2", "Key3"};
string source = "Key1:Value1Key2: ValueAnd A: To Test Key3:   Something";
FindKeys(keys, source);

private void FindKeys(IEnumerable<string> keywords, string source) {
    var found = new Dictionary<string, string>(10);
    var keys = string.Join("|", keywords.ToArray());
    var matches = Regex.Matches(source, @"(?<key>" + keys + "):",
                          RegexOptions.IgnoreCase);            

    foreach (Match m in matches) {
        var key = m.Groups["key"].ToString();
        var start = m.Index + m.Length;
        var nx = m.NextMatch();
        var end = (nx.Success ? nx.Index : source.Length);
        found.Add(key, source.Substring(start, end - start));
    }

    foreach (var n in found) {
        Console.WriteLine("Key={0}, Value={1}", n.Key, n.Value);
    }                            
}

And the output from this is:

Key=Key1, Value=Value1
Key=Key2, Value= ValueAnd A: To Test 
Key=Key3, Value=   Something
Andrew
  • 8,322
  • 2
  • 47
  • 70
  • @Andrew, nice one! The "Key2: ValueAnd A:" Space in the Value was exactly what the problem was. thanks! – Flo Jan 21 '09 at 12:50
  • Glad I could help. I'm still trying to figure out a good clean regex way to do this, maybe with a simple loop, but i can only get 70% "rightness" so far. – Andrew Jan 21 '09 at 20:08
  • Looking forward to the 100% :-) – Flo Jan 22 '09 at 09:55
  • thanks for the nice answer, almost what Im looking for, however how to modify the solution to return 1.1, 1.2, 1.3 if the original string is: ''Key1:(1.1)Key2:(1.2)And A: To Test Key3:(1.3)blahblahblah'' . After applying this solution on my string, I grab each and split the values based on the parentheses. But isn't there a nicer pure regex solution? – AleX_ Nov 15 '16 at 20:23
0
/KeywordB\: (\w)/

This matches any word that comes after your keyword. As you didn´t mentioned any terminator, I assumed that you wanted only the word next to the keyword.

Tiago
  • 9,457
  • 5
  • 39
  • 35
  • That doesn't appear to be a C# regular expression, but rather suspiciously like perl. – Andrew Jan 18 '09 at 01:35
  • @Andrew, do you mean because it's enlosed in slashes? That's no big deal; just replace them with quotes. There's nothing in the regex itself that would cause C# to barf. – Alan Moore Jan 18 '09 at 02:37
  • @Tiago, the real problem is that `\w` only matches one character; you should change it to `\w+`. Also, I believe Flo only used "KeywordB" as an example, and you should replace that with `\w+` as well. – Alan Moore Jan 18 '09 at 03:05