0

I'm trying to write a regular expression to extract from a URL, but the problem is "." doesn't match newline as we already know. How do I write a regular expression to match and extract pageTitle (.*?) but newline could be in anywhere between

I'm using grails.

toy
  • 11,711
  • 24
  • 93
  • 176
  • 4
    Hmmm, any chance you are trying to parse HTML with Regex? Hope [you not](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) or *The
    cannot hold it is too late*.
    – Darin Dimitrov Jul 11 '11 at 21:49
  • What does Grails have to do with this? Maybe you meant Groovy? – Gregg Jul 11 '11 at 21:58

3 Answers3

4

Whilst you can't use a regex to parse general HTML, you can probably get away with it in this case. In Groovy, you can use (?s) operator to make the dot match newlines. You should also probably use the (?i) operator to make your regex case-insensitive. You can combine these as (?is).

For example

def titleTagWithNoLineBreaks = "<title>This is a title</title>"
def titleTagWithLineBreaks = """<title>This is
a title</title>"""

// Note the (?is) at the beginning of the regex
// The 'i' makes the regex case-insensitive
// The 's' make the dot match newline characters
def pattern = ~/(?is)<title>(.*?)<\/title>/

def matcherWithNoLineBreaks = titleTagWithNoLineBreaks =~ pattern
def matcherWithLineBreaks = titleTagWithLineBreaks =~ pattern

assert matcherWithNoLineBreaks.size() == 1
assert matcherWithLineBreaks.size() == 1

assert matcherWithLineBreaks[0][1].replaceAll(/\n/,' ') == "This is a title"

Hope that helps.

Adam C
  • 585
  • 4
  • 14
1

Assuming it's for PHP:

preg_match( "#<title>(.*?)</title>#s", $source, $match );
$title = $match[1];

Regardless of what software you are using, adding the s extension will modify the . (any character) so that it includes newlines.

Nahydrin
  • 13,197
  • 12
  • 59
  • 101
  • where should I add the "s" at the end of my pattern? it doesn't work in groovy. – toy Jul 11 '11 at 22:00
  • Use seperators to hold your regex in, and put the `s` at the end of the regex. – Nahydrin Jul 11 '11 at 22:01
  • 2
    @Dark Slipstream, why are you pushing the poor soul towards oblivion by encouraging it to use regular expressions to parse HTML? – Darin Dimitrov Jul 11 '11 at 22:12
  • I tried using XMLParser already, but some of the website isn't well-formed. – toy Jul 11 '11 at 22:13
  • @toy, why are you using a site which is not well formed? Find an alternative. Personally if a see that a site doesn't respect web standards I wouldn't bet my bucks on it, even less do any development based on it. – Darin Dimitrov Jul 11 '11 at 22:15
  • @Darin, He asked how to do it using REGEX and I answered as such. If he wants to use an XML Parser, then he can do so on his own accord, that's not my decision to make. – Nahydrin Jul 11 '11 at 22:33
  • @Dark Slipstream, of course that it is your decision. StackOverflow is a well referenced site on google. Answers posted here are considered to be good. Imagine in the future people having the same problem and googling for it and come up to your answer. What will they see? Instead of seeing DO NOT, FOR CHRIST SAKE, PARSE HTML WITH REGEX, they will see your answer which illustraes a regex. And what would some of them do? They will try to parse their HTML with a regex. Exactly as you showed. And what will the result be? I think @bobince already covered it in his answer. – Darin Dimitrov Jul 11 '11 at 22:36
  • That's your perrogative. He said he tried and failed. I answer as the OP wants, not as I want. – Nahydrin Jul 11 '11 at 22:37
  • @Dark Slipstream, answering what the OP wants is not always the best. When answering a question you should consider that there might be better solutions. Like for example using a full blown HTML parser. And recommend those alternatives. It's not because the OP asked how to parse HTML with regex that you should show a regex (especially if you don't approve it either). Answering on SO should be about what **YOU** consider is the best practice in this case, and express this opinion. At least that's what I do when I answer questions here. And that's what I would recommend you doing. – Darin Dimitrov Jul 11 '11 at 22:40
0

If all you need is to parse possibly-malformed HTML documents you could try using the TagSoup parser. Then you could just use GPath expressions and won't have to worry about weirdness like "</title>" in a comment in the title and such.

import org.ccil.cowan.tagsoup.Parser

final parser  = new Parser()
final slurper = new XmlSlurper(parser)
final html    = slurper.parse('http://www.example.com/')

println html.depthFirst().find { it.name() == 'title' }
Justin Piper
  • 3,154
  • 1
  • 20
  • 14