0

I have a java function to extract a string out of the HTML Page source for any website...The function basically accepts the site name, along with a term to search for. Now, this search term is always contained within javascript tags. What I need to do is to pull the entire javascript (within the tags) that contains the search term.

Here is an example -

<script type="text/javascript">
    //Roundtrip
    rtTop = Number(new Date());

    document.documentElement.className += ' jsenabled';
</script>

For the javascript snippet above, my search term would be "rtTop". Once found, I want my function to return the string containing everything within the script tags.

Any novel solution? Thanks.

rs79
  • 2,311
  • 2
  • 33
  • 39
  • java.equals(javascript) == false is true – jmj Sep 30 '10 at 18:34
  • 2
    @org.life.java: the OP isn't equating java and javascript. He's writing a java function that will pull javascript code out of an HTML string. Basically it's an HTML parser that only needs to do one thing. The fact that the string it's looking for is javascript isn't really relevant to the question. – Jacob Mattison Sep 30 '10 at 18:43
  • @JacobM oh my mistake , bus still the above comment is true :-) – jmj Sep 30 '10 at 18:44
  • It is guaranteed that the search term is enclosed within the javascript tags. I just need to extract ALL the content, within the JS tags. – rs79 Sep 30 '10 at 19:20

2 Answers2

2

You could use a regular expression along the lines of

String someHTML = //get your HTML from wherever
Pattern pattern = Pattern.compile("<script type=\"text/javascript\">(.*?rtTop.*?)</script>",Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String result = myMatcher.group(1);
Jacob Mattison
  • 50,258
  • 9
  • 107
  • 126
  • So if I have a String variable storing the entire HTML page source, how would I employ the regular expression? – rs79 Sep 30 '10 at 19:28
1

I wish I could just comment on JacobM's answer, but I think I need more stackCred.

You could use an HTML parser, that's usually the better solution. That said, for limited scopes, I often use regEx. It's a mean beast though. One change I would make to JacobM's pattern is to replace the attributes within the opening element with [^<]+

That will allow you to match even if the "type" isn't present or if it has some other oddities. I'd also wrap the .*? with parens to make using the values later a little easier.

* UPDATE * Borrowing from JacobM's answer. I'd change the pattern a little bit to handle multiple elements.

String someHTML = //get your HTML from wherever
String lKeyword = "rtTop";
String lRegexPattern = "(.*)(<script[^>]*>(((?!</).)*)"+lKeyword +"(((?!</).)*)</script>)(.*)";
Pattern pattern = Pattern.compile(lRegexPattern ,Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String lPreKeyword = myMatcher.group(3);
String lPostKeyword = myMatcher.group(5);
String result = lPreKeyword + lKeyword + lPostKeyword;

An example of this pattern in action can be found here. Like I said, parsing HTML via regex can get real ugly real fast.

Community
  • 1
  • 1
Snekse
  • 15,474
  • 10
  • 62
  • 77
  • Ug, not having enough reputation yet to leave comments is a pain. This comment is actually for JacobM's answer. The capturing group of (.*?rtTop*?) should be changed to (.*?rtTop*?.*?) to account for the characters after the keyword rtTop – Snekse Sep 30 '10 at 20:16
  • Thanks so much @Snekse. The minor correction to the regex fixed it. Now, onto refreshing my regex knowledge :) – rs79 Sep 30 '10 at 20:47
  • another followup question - in my regular expression, can I account for a case insensitive match? For instance, even though the search terms is "rtTop", I would like matches to be registered for "RTTOP", "rttop", etc.... – rs79 Oct 01 '10 at 15:43
  • 1
    That's usually handled through a flag in the regEx engine. So replace "Pattern.DOTALL" with "Pattern.CASE_INSENSITIVE | Pattern.DOTALL". Note: This will make the entire expression case-insensitive. – Snekse Oct 01 '10 at 19:19