0

I am attempting to parse a flavour of markdown that has some keywords in quotes or angular brackets.

Words between " are static keywords, and the ones between < and > are dynamic.

Sample:

* Say "hello" to "world"
* Say <something> to <somebody>
* I can also be a plain statement

The logic goes like this:

  • find all lines that are defined with a starting *
  • Check if the line has keyword
  • Extract keywords if any.

I have a simple regex (\W+(\*.+)) that helps me extract the line, but am not sure how to extend it to extract the values between quotes or angular brackets.

UPDATE 1

So, after hint from @EvanKnowles' link, I came up with this regex which seems to work, but I'll be happy to get any improvements on this.

[ ]*\*([\w ]*(["\<][\w ]+["\>])*)*

UPDATE 2 A few people have suggested doing this in steps i.e. get all valid lines in first pass, and then look up keywords in each line. I'd like to keep this as my last option, the context is that the consumer of this code needs to know the keywords and it's position in the entire string. So maintaining offset is an overhead that I will be inviting on splitting the parent string.

Srikanth Venugopalan
  • 9,011
  • 3
  • 36
  • 76
  • Have a look at http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string – Evan Knowles Oct 30 '14 at 09:01
  • Extract the lines via regex to strings and then split them up? `string[] result = regexString.Split('"');`. Your result would be every even index (2,4,6,8..). – C4d Oct 30 '14 at 09:14
  • @C4ud3x Ah but there is a catch - if I split the lines, I need to keep track of the offset (i.e. line number) for the consumer of this bit, and I am trying to avoid it. – Srikanth Venugopalan Oct 30 '14 at 09:21
  • Yeah, a maybe bad way to go. Check Coder Hawk's solution. Its grabbin the keyword only. The regex is not complete I guess, but a very good starting point. – C4d Oct 30 '14 at 09:26

3 Answers3

2

Below expression will extract all the keywords. Try it!

    /// <summary>
    ///  A description of the regular expression:
    ///  
    ///  Beginning of line or string
    ///  [1]: A numbered capture group. [.*?\"(?<keyword>.*?)\"], one or more repetitions
    /// .*?\"(?<keyword>.*?)\"
    ///          Any character, any number of repetitions, as few as possible
    ///          Literal "
    ///          [keyword]: A named capture group. [.*?]
    ///              Any character, any number of repetitions, as few as possible
    ///          Literal "
    ///  
    ///
    /// </summary>
        public static Regex regex = new Regex(
              "^(.*?\\\"(?<keyword>.*?)\\\")+",
            RegexOptions.IgnoreCase
            | RegexOptions.Multiline
            );

        // Capture all Matches in the InputText
         MatchCollection ms = regex.Matches(InputText);

Use Expresso tool to learn and create regular expression, it will help to create C# or VB.NET Code

Sandeep Kumar M
  • 3,841
  • 3
  • 38
  • 59
  • Big thanks for linking `Expresso`. Just downloaded it. Really nice tool to test regex. :) Upvoted. – C4d Oct 30 '14 at 09:21
  • Doesn't seem to work exactly as I expected. Shouldn't the `keyword` group be repeated? I can have multiple keywords per line. THanks for link to `Espresso`. I use [Regex Storm](http://regexstorm.net/tester) for .NET and [Rubular](http://www.rubular.com) for Ruby. – Srikanth Venugopalan Oct 30 '14 at 09:25
  • Yep. This regex only shows up the last keyword instead of all ones from the line. – C4d Oct 30 '14 at 09:27
  • 1
    Check this out: `^(.*?\"(?.*?)\")+`. Its one step closer to the goal. I changed the beginning `\\*.*\"` (which grabs the result, where 'any characters' match for the highest lenght until the `"`) to `.*?\"` (where this matches any characters as few as possible) until the `"`comes. This repeated one or more times will grab everything. I hope that was clear as my english isnt thaaat good. :) – C4d Oct 30 '14 at 09:50
1
^(?=\*).*$

You can do this in two steps.First grab the lines starting from *.See demo.

http://regex101.com/r/dP9rO4/2

Then you can grab the keywords through captures or matches.

See demo.

http://regex101.com/r/eM1xP0/2

vks
  • 67,027
  • 10
  • 91
  • 124
  • Thanks, but if I split lines, I am inviting other problems. I need to maintain a track of offset (line number) to know where the keyword was originally positioned. – Srikanth Venugopalan Oct 30 '14 at 09:28
  • @SrikanthVenugopalan you dont have to split,you can search the groups per line. – vks Oct 30 '14 at 09:29
  • Neat, now comes the tricky bit, these matches need to happen only on lines starting with a `*`. And there can also be lines without any keywords, which need to be matched too. – Srikanth Venugopalan Oct 30 '14 at 09:32
  • @SrikanthVenugopalan i have already given two regexes.The first will give all lines starting from `*`.Next get the groups from each line by applying the second regex. – vks Oct 30 '14 at 09:51
0
public static Regex regex = new Regex(
      "^\\*.*(<|\")(\\w+)(>|\")",
    RegexOptions.IgnoreCase
    | RegexOptions.Multiline
    | RegexOptions.CultureInvariant
    | RegexOptions.IgnorePatternWhitespace
    | RegexOptions.Compiled
    );

Using this Regex the values between " or <|> will appear at group index 2, then you can lookup to group index 1 to discover if the found match is a static or an dynamic keyword.

As @coder-hawk said "Use Expresso". It's a free and a very useful tool to write and test Regular Expressions.

giacomelli
  • 7,287
  • 2
  • 27
  • 31