-2

The text follows this pattern

<tr class="text" (any sequence of characters here, except ABC)ABC(any sequence of characters here, except ABC)
<tr class="text" (any sequence of characters here, except ABC)ABC(any sequence of characters here, except ABC)
<tr class="text" (any sequence of characters here, except ABC)ABC(any sequence of characters here, except ABC)
<tr class="text" (any sequence of characters here, except ABC)ABC(any sequence of characters here, except ABC)

so basically the above line (which might include line breaks) might repeat itself multiple times, and the idea is to retrieve the first 3 characters immediately after ABC.

I have tried regular expressions along the lines of

 \<tr class="text" [.\n]+ABC(?<capture>[.]{3})

but they all fail. Can someone give me a hint?

John Smith
  • 4,416
  • 7
  • 41
  • 56
  • so do you want to retrieve `)AB` or `(AN`? – Sam I am says Reinstate Monica Nov 21 '12 at 23:00
  • `(AN` since the `ABC` inside the parentheses isn't actually there. – Martin Ender Nov 21 '12 at 23:01
  • 1
    There are better string functions for this. Find "ABC" location and sub-string from there with length of 3. This is lot faster when you deal with a lot of data. – Shiplu Mokaddim Nov 21 '12 at 23:02
  • shiplu, that's a good point, but I'm also capturing a bunch of other stuff. the regex fails the moment I try to search for ABC within the sequence. – John Smith Nov 21 '12 at 23:17
  • If you're really asking about "a bunch of other stuff", then please include it in your question. You do not need regexes to answer the question you posted, we cannot know if they will work for your real question if you don't post your real question. – Dour High Arch Nov 21 '12 at 23:48

3 Answers3

1

You effectively escape the wildcard to become a literal period. Just use

\<tr class="text" .+?ABC(?<capture>.{3})

Make sure you use RegexOptions.Singleline, so that . matches linebreaks, too!

However, you shouldn't actually use regular expressions at all. Instead, use DOM parser. I have seen the HTML Agility Pack being recommended quite regularly for .NET.

Community
  • 1
  • 1
Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • Your regex won't do the trick. Also, I already considered the HTML Agility Pack but decided to use a regex instead since I don't want to introduce third party code into this application. – John Smith Nov 21 '12 at 23:05
  • would you care to show the code that uses the regular expression? – Martin Ender Nov 21 '12 at 23:07
  • well, here's the thing, .+ABC is wrong because the . does not account for the \n character, which might be somewhere in the sequence. [.\n]+ won't do it either – John Smith Nov 21 '12 at 23:09
  • @Patrick if there is only one `ABC` as the OP states, that won't make a difference – Martin Ender Nov 21 '12 at 23:09
  • @JohnSmith now that is important. your question states that those are "lines" which means, they don't contain linebreaks. I'll edit the answer – Martin Ender Nov 21 '12 at 23:10
  • @m.buettner except that the .+ will match the ABC before it reaches that part of the pattern – Patrick Nov 21 '12 at 23:11
  • @Patrick what do you mean? if it was all on a single line and there was only one `ABC` then `.+` and `.+?` would both give the same result, since the engine backtracks in both cases until it finds the `ABC` – Martin Ender Nov 21 '12 at 23:12
  • @m.buettner interesting. I was sure it didn't do that. Must be a different engine, because I know I've had to use that before in similar situations. – Patrick Nov 21 '12 at 23:18
  • Dear downvoter, would you care for a comment? – Martin Ender Nov 21 '12 at 23:25
  • @Patrick, most engines should do. Unless you use the quantifier inside an atomic group (that includes lookarounds), or you make it explicitly possessive (`.++`). In these cases you are right. – Martin Ender Nov 21 '12 at 23:30
0

Here is a regex that will capture the first 3 letters after some "ABC" in your string

".+ABC(...)"

in c#, your match will have a collection of groups, and one of those groups will be the 3 letters

Just make sure that you don't have any un-expected "ABC"s in your string, because that will mess it up

this code

public static void Main()
{
    Regex regex = new Regex(".+ABC(...)");

    Match match = regex.Match("baln390nABCqlcln");
    foreach (Group group in match.Groups)
    {
        Console.WriteLine(group.Value);
    }
}

gives this output

baln390nABCqlc
qlc
Press any key to continue . . .
0
<tr class="text" .+ABC(?<capture>.{3})

In conjunction with RegexOptions.Singleline (so that . matches line breaks).

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
Rashack
  • 4,667
  • 2
  • 26
  • 35
  • Well - maybe we both started typing at the same time and you were faster... – Rashack Nov 27 '12 at 13:18
  • Fair enough. Unfortunately, your answer still doesn't solve the whole problem (which is line breaks in the string). – Martin Ender Nov 27 '12 at 13:58
  • I wasn't saying it's not possible (see my answer). And I think even closed question should rather have correct answers than abandoned ones ;) (the question has not been deleted after all) – Martin Ender Nov 28 '12 at 19:59