46

If I have the following pattern in some text:

def articleContent =  "<![CDATA[ Hellow World ]]>"

I would like to extract the "Hellow World" part, so I use the following code to match it:

def contentRegex = "<![CDATA[ /(.)*/ ]]>"
def contentMatcher = ( articleContent =~ contentRegex )
println contentMatcher[0]

However I keep getting a null pointer exception because the regex doesn't seem to be working, what would be the correct regex for "any peace of text", and how to collect it from a string?

ekad
  • 14,436
  • 26
  • 44
  • 46
RicardoE
  • 1,665
  • 6
  • 24
  • 42

5 Answers5

72

Try:

def result = (articleContent =~ /<!\[CDATA\[(.+)]]>/)[ 0 ]​[ 1 ]

However I worry that you are planning to parse xml with regular expressions. If this cdata is part of a larger valid xml document, better to use an xml parser

tim_yates
  • 167,322
  • 27
  • 342
  • 338
  • 1
    Its not for valid xml... that's the problem. THank you very much! – RicardoE Jul 08 '13 at 22:36
  • 5
    I'm new to Groovy. Can you please explain why we need to dereference the matcher with `[0]` in order to get a list of groups? – Gili Dec 02 '16 at 06:05
  • 1
    @Gili Because there can be multiple matches, int this case `"<![CDATA[ Hellow World ]]> <![CDATA[ Hi Everyone ]]>"` you could extract `Hi Everyone` with `[1][1]`. – Federico Nafria Oct 13 '20 at 15:32
10

The code below shows the substring extraction using regex in groovy:

class StringHelper {
@NonCPS
static String stripSshPrefix(String gitUrl){
    def match = (gitUrl =~ /ssh:\/\/(.+)/)
    if (match.find()) {
        return match.group(1)
    }
    return gitUrl
  }
static void main(String... args) {
    def gitUrl = "ssh://git@github.com:jiahut/boot.git"
    def gitUrl2 = "git@github.com:jiahut/boot.git"
    println(stripSshPrefix(gitUrl))
    println(stripSshPrefix(gitUrl2))
  }
}
slim
  • 2,545
  • 1
  • 24
  • 38
jiahut
  • 1,451
  • 15
  • 14
2

A little bit late to the party but try using backslash when defining your pattern, example:

 def articleContent =  "real groovy"
 def matches = (articleContent =~ /gr\w{4}/) //grabs 'gr' and its following 4 chars
 def firstmatch = matches[0]  //firstmatch would be 'groovy'

you were on the right track, it was just the pattern definition that needed to be altered.

References:

https://www.regular-expressions.info/groovy.html

http://mrhaki.blogspot.com/2009/09/groovy-goodness-matchers-for-regular.html

Michael Y
  • 143
  • 7
1

One more sinle-line solution additional to tim_yates's one

def result = articleContent.replaceAll(/<!\[CDATA\[(.+)]]>/,/$1/)

Please, take into account that in case of regexp doesn't match then result will be equal to the source. Unlikely in case of

def result = (articleContent =~ /<!\[CDATA\[(.+)]]>/)[0]​[1]

it will raise an exception.

Naeel Maqsudov
  • 1,352
  • 14
  • 23
0

In my case, the actual string was multi-line like below

ID : AB-223
Product : Standard Profile
Start Date : 2020-11-19 00:00:00
Subscription : Annual
Volume : 11
Page URL : null
Commitment : 1200.00
Start Date : 2020-11-25 00:00:00

I wanted to extract the Start Date value from this string so here is how my script looks like

def matches = (originalData =~ /(?<=Actual Start Date :).*/)
def extractedData = matches[0]

This regex extracts the string content from each line which has a prefix matching Start Date :

In my case, the result is is 2020-11-25 00:00:00

Note : If your originalData is a multi-line string then in groovy you can include it as follows

def originalData = 
"""
ID : AB-223
Product : Standard Profile
Start Date : 2020-11-19 00:00:00
Subscription : Annual
Volume : 11
Page URL : null
Commitment : 1200.00
Start Date : 2020-11-25 00:00:00
"""

This script looks simple but took me some good time to figure out few things so I'm posting this here.

Akhil
  • 368
  • 4
  • 16