1

I am essentially trying to extract the dollar amount using regex, but can't figure out how to extract just the dollar amount, which can vary in digits. Below is an example of the amount field I want to extract which is always in the middle of the field:

&ltfield1&gt05/14/2013&lt/field1&gt&ltamount&gt3,100,000.00&lt/amount&gt&ltfield3&gt026002561&lt/field3&gt

what i currently have: <amount>.*</amount> (this result doesnt get me what I want)

for this field, I want to extract just the 3.1 million figure. The structure (similar to html) surrounding the dollar figure will always be the same. Any help is appreciated.

SeanC
  • 15,695
  • 5
  • 45
  • 66
Chris
  • 13
  • 2
  • 3,100,000.00 field3 is separate and contains different information, i included that to illustrate that the dollar figure is in the middle of the string. Thanks – Chris Jun 03 '13 at 17:11
  • What error are you getting, why doesn't `.*` work? According to standard Regex rules, that *will* work, so is it possible you're doing something else wrong, like perhaps not setting up a capture group so that you can extract the 3.1m value? – qJake Jun 03 '13 at 17:15
  • Show the code you're using to apply the regex – Tim Williams Jun 03 '13 at 17:35
  • The regex I tried works, but I want the value parsed out of ..... – Chris Jun 03 '13 at 17:52
  • After looking up a capture group, it appears that could work, any thoughts on how to approach it in this situation? – Chris Jun 03 '13 at 18:28

2 Answers2

0

Excel

Since you're doing this in excel, you might want to consider using this formula

=MID(B1,SEARCH("<amount>",B1)+8,SEARCH("</amount>",B1)-(SEARCH("<amount>",B1) + 8))

  • B1 = input string
  • the +8 compensates for the width of the string <amount>
  • column C reveals the formula used

enter image description here

Regex

If you're doing this with VBA and a regex you could use the regex: <(amount)\b[^>]*>([^<]*)<\/\1>

enter image description here

This VB.net example is only included to show how the regex populates the group 3 with each of the dollar values found in amount tags.

Imports System.Text.RegularExpressions
Module Module1
  Sub Main()
    Dim sourcestring as String = "<field1>05/14/2013</field1><amount>3,100,000.00</amount><field3>026002561</field3>
    <field1>05/14/2013</field1><amount>4,444,444.00</amount><field3>026002561</field3>"
    Dim re As Regex = New Regex("<(amount)\b[^>]*>([^<]*)<\/\1>",RegexOptions.IgnoreCase OR RegexOptions.Multiline OR RegexOptions.Singleline)
    Dim mc as MatchCollection = re.Matches(sourcestring)
    Dim mIdx as Integer = 0
    For each m as Match in mc
      For groupIdx As Integer = 0 To m.Groups.Count - 1
        Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames(groupIdx), m.Groups(groupIdx).Value)
      Next
      mIdx=mIdx+1
    Next
  End Sub
End Module

$matches Array:
(
    [0] => Array
        (
            [0] => <amount>3,100,000.00</amount>
            [1] => <amount>4,444,444.00</amount>
        )

    [1] => Array
        (
            [0] => amount
            [1] => amount
        )

    [2] => Array
        (
            [0] => 3,100,000.00
            [1] => 4,444,444.00
        )

)
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • thanks for the help, this is interesting. I will incorporate and see what I get – Chris Jun 03 '13 at 20:08
  • Did you mean to dim the variable m? Im trying to get your code to work – Chris Jun 03 '13 at 20:26
  • This code is actually vb.net, which will be slightly different the vba – Ro Yo Mi Jun 03 '13 at 20:28
  • gotcha, thought it looked a little funny – Chris Jun 03 '13 at 20:36
  • @Denomales Not sure about the `\b[^>]*` part in the regex. What purpose does it serve? Wouldn't it match `3,100,000.00<\amount>` which would be incorrect? The correct and least expensive regex would be `([^<]*)<\/amount>` – Kash Jun 03 '13 at 20:58
  • @Denomales: Also not sure why you are using backreference here just for `amount` leading to another expensive capturing group. Backreferences are more useful when you are trying to reuse another complex regex unlike this case where it is a literal `amount`. – Kash Jun 03 '13 at 21:08
  • @Denomales I suspect that the `\b[^>]` was meant to match any attributes for the opening xml tag. But this is not the requirement of the OP and hence would become a defect. – Kash Jun 03 '13 at 21:10
  • The `\b` captures the word break after the tag name, the `\[^>]*` captures the next zero to infinite non `>` characters upto the next `>` character. This is not defect in your example `3,100,000.00<\amount>` the `[^>]*` portion would capture the ` blah` including the space before blah. – Ro Yo Mi Jun 03 '13 at 22:42
  • 1
    @Kash, what do you mean when you refer to a capturing group as expensive? – Chris Jun 04 '13 at 01:26
  • @Denomales a) Your regex introduces patterns that the OP does not need and creates unnecessary side effects. The question was to match `` and not anything else. Introducing a pattern like `\b[^>]*` will also match malformed attributes. – Kash Jun 04 '13 at 05:32
  • @Chris Your regex performance deteriorates with increasing number of capturing groups, besides other (many other) factors. That's why it is preferred to use non-capturing groups in cases when you need to just use quantifiers on a group. You do not need either in this case as you know exactly what you need to match. My point is you don't need a subpattern to determine what you need to match when you already know what the subpattern would yield - in this case amount. – Kash Jun 04 '13 at 18:49
  • @Kash, To-may-to To-ma-to. In many cases people come here to ask specific questions with watered down sample text. OP didn't specify they wanted the most efficient solution, nor do they specify the amount of text being parsed. At what point does the delta between this expression's performance vs your expression's performance become even noticeable? – Ro Yo Mi Jun 04 '13 at 19:43
  • @Denomales Efficiency with regexes is crucial coz it decides if you need the regex at all. Reg the delta, exactly my point - you would never know when the delta becomes noticeable esp with regexes coz you do not have control on the size of the input text. And it ain't just a matter of efficiency in this case, because you would have it match crazier text like `>3,100,000.00` extracting `>3,100,000.00` which is erroneous. – Kash Jun 04 '13 at 20:30
  • hahaha. It's funny because your expression `([^<]*)<\/amount>` also captures `>3,100,000.00` which by your own standard is erroneous. Which is more likely: requester will encounter illegal x/html like `>3,100,000.00` or text with attributes like `3,100,000.00`. Keep in mind your expression will also fail if there are any attributes. – Ro Yo Mi Jun 05 '13 at 02:53
-1

Use Excel VBA to extract your capturing group.

VBA code

Function TestRegExp(ByVal myString As String, _
                      ByVal myPattern As String, _
                      Optional seperator As String = "") As String
   Dim objRegExp As RegExp
   Dim colMatches As MatchCollection
   Dim RetStr As String

   Set objRegExp = New RegExp
   objRegExp.Pattern = myPattern
   objRegExp.IgnoreCase = True
   objRegExp.Global = True
   seperator = "|"

   If (objRegExp.Test(myString) = True) Then
    Set colMatches = objRegExp.Execute(myString)
    For i = 0 To colMatches.Count - 1
        For j = 0 To colMatches.Item(i).SubMatches.Count - 1
            If (RetStr <> "") Then
                RetStr = RetStr & seperator & colMatches.Item(i).SubMatches.Item(j)
            Else
                RetStr = colMatches.Item(i).SubMatches.Item(j)
            End If
        Next
    Next
   Else
    RetStr = "No Match"
   End If
   TestRegExp = RetStr
End Function

Excel
And the function in Excel to test this would be:

=TestRegExp(B2,"<amount>([^<]*)<\/amount>")

where cell B2 has your text:

<field1>05/14/2013</field1><amount>3,100,000.00</amount><field3>026002561</field3>
Output: 3,100,000

OR

<field1>05/14/2013</field1><amount>3,100,000.00</amount><field3>026002561</field3><amount>999</amount>  
Output: 3,100,000|999

Please note:

  1. Use .*? instead of .*. This helps tackle multiple amount tags issue as it parses lazily due to the question mark. You can choose your separator in the code.
  2. The trick is to use submatches to get capturing groups
Kash
  • 8,799
  • 4
  • 29
  • 48