1

i am trying to read from a webpage and get the last modified date from meta. for example

<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta http-equiv="last-modified" content="Mon, 17 Sep 2012 13:57:35 SGT" />
</head>

i am reading line by line, how can i build the regex in this case? I am fairly new to the regex. i have tried

line.matches("<meta http-equiv=\"last-modified\" content=\"(\w)*\" /> "); 

but do not think it is correct.

onegun
  • 803
  • 1
  • 10
  • 27
  • Usually people will say regex is not the right tool for HTML, but in this case, there should be no nested tags, so it should be fine. `/\/` something like that. – Orbling Sep 27 '12 at 17:11
  • b) [Parsing HTML is not a task for regexes](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Martin Ender Sep 27 '12 at 17:12
  • Don't use regex as it is an overkill for this job. Just find the matching tag, and read after `content="` – nullpotent Sep 27 '12 at 17:12
  • line.matches(" "); – onegun Sep 27 '12 at 17:12
  • @Orbling the attributes could still be in a different order. And you can't be sure that the tags will always be on their own line. – Martin Ender Sep 27 '12 at 17:12
  • It's fine in this case to use regex, it's just plucking attributes out of a tag. – Orbling Sep 27 '12 at 17:12
  • @user1275129 please edit that into the question – Martin Ender Sep 27 '12 at 17:13
  • @m.buettner: The attributes being in alternate order can be handled if needed. The content might be in a regular layout anyhow. – Orbling Sep 27 '12 at 17:13
  • hi iccthedral, how can i do that... i am clueless atm – onegun Sep 27 '12 at 17:13
  • Sure... as can be lots of things in HTML. I'm usually the first one to check out whether a **specific** HTML problem can't be solved with regex. But still it's fair to note, that a 100% working version with regex isn't as easy as you'd think at first glance. – Martin Ender Sep 27 '12 at 17:14
  • @m.buettner: Aye, which is why I caveated my comment with "in this case", regex is frequently not the best candidate for this sort of task. But parsing a full DOM structure can be overkill too. – Orbling Sep 27 '12 at 17:15
  • @user1275129: the problem with your attempt is, that \w only matches `[a-zA-Z0-9_]` ... but you also need `,:` and whitespace – Martin Ender Sep 27 '12 at 17:16
  • i will try to adjust now – onegun Sep 27 '12 at 17:20

3 Answers3

1

While you should never use regex to parse html, if you insist upon it, heres a regex option

Pattern metaPattern = Pattern.compile("meta .*\"last-modified\" content="(.*)");
Matcher metaMatch = metaPattern.matcher(sampleString);
if metaMatch.matches()
{
    System.out.println(metaMatch.group(1));
}
Community
  • 1
  • 1
Tadgh
  • 1,999
  • 10
  • 23
  • ty, i also found out this String metaRegex=""; final Pattern metapa = Pattern.compile(metaRegex, Pattern.DOTALL); final Matcher metamatch = metapa.matcher(html); String lastmodified=null; if (metamatch.find()) { lastmodified=metamatch.group(1); } – onegun Sep 27 '12 at 17:43
0

You can't use \w only for your group, since your target information contains non-word characters.

Try something like:

String line = "<meta http-equiv=\"last-modified\" content=\"Mon, 17 Sep 2012 13:57:35 SGT\" />";

Pattern p = Pattern.compile("<meta .*last-modified.*content=\"(.*)\".*");
Matcher m = p.matcher(line);
if (m.matches())
    System.out.println(m.group(1));

Output:

Mon, 17 Sep 2012 13:57:35 SGT
pb2q
  • 58,613
  • 19
  • 146
  • 147
0

And here's a solution with no regex.

Of course, you would have to be careful using this and do some checks beforehand.

String data = "<head>" +  
              "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=windows-1252\">" +
              "<meta http-equiv=\"last-modified\" content=\"Mon, 17 Sep 2012 13:57:35 SGT\" />" + 
              "</head>";

String key =  "<meta http-equiv=\"last-modified\" content=\"";

int from = data.lastIndexOf(key);
String tag = data.substring(from + key.length());
int to = tag.indexOf("\"");
String date = tag.substring(0, to);
System.out.println(date);
nullpotent
  • 9,162
  • 1
  • 31
  • 42