regular expression to parse html title tag

Question

I need to parse a lot of html files in order to know which ones contain specific text within title tag.

Let's suppose that titles are

file1.htm
<title>100 text other text</title>
file2.htm
<title>text 100 text other text</title>
file3.htm
<title>text 1000 text other text</title>
file4.htm
<title>text one hundred text other text</title>

Following my example I need to find files name that contain 100 or one hundred, that is files 1,2 and 4.

My problem is that I don't know how to write regular expression

gci "c:\my_folder" | ? {$_.extension -eq ".htm"} | 
select-string -pattern '<title>*100*</title>' |
Select-Object -Unique Path

Please note, if this may be important for regexp, that title tag is not at the beginning of a row but in the middle. Thanks in advance.

Obligatory warning about parsing HTML using regular expressions: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Gareth McCaughan, Apr 12 '11 at 15:00
To expand a little on this point: the contents of the `title` element might extend over multiple lines; they might contain other HTML markup. HTML markup in general is unsuited to regular-expression processing because of its nested structure. Perhaps in this case you know exactly how your input files were made, and you know that the title is always on a single line and never contains anything difficult. **IF** that is so, then regular expressions may not be a crazy approach. Otherwise, you really ought to use a proper parser. — Gareth McCaughan, Apr 12 '11 at 15:03
Thanks for your replies. I've read the link but if it's bad and I can't parse them do I have some alternative? edit. Now I read your last reply. — Nicola Cossu, Apr 12 '11 at 15:04
I don't know anything about PowerShell, but you should try replacing `*` with `.*` in the pattern. `.` matches any character. `*` means "any number of things matching the thing I just said". So `.*` means "any number of arbitrary characters". But, I repeat, please consider carefully whether regular expressions are really an appropriate tool for this job. — Gareth McCaughan, Apr 12 '11 at 15:05
Thanks again Gareth for your detailed answers. I'll google for some alternative if powershell is not the right tool for this job. — Nicola Cossu, Apr 12 '11 at 15:07
Regexes are _fun!_ But I would strongly recommend spending an hour or two studying the basics. There is an excellent online tutorial at: [www.regular-expressions.info](http://www.regular-expressions.info/). The time you spend there will pay for itself many times over. Happy regexing! — ridgerunner, Apr 12 '11 at 16:04

jimplode · Accepted Answer · 2011-04-12T15:09:48.457

2

This should do it.

^.*<title>(.*(100|one\shundred)[^0].*)?</title>.*$

edited Apr 12 '11 at 15:09

answered Apr 12 '11 at 14:59

jimplode

3,474
3
24
42

May I ask you what [^0] does it mean? – Nicola Cossu Apr 12 '11 at 15:21
Don't allow 1000, which 100+0 – user unknown Apr 12 '11 at 15:24
@nick rulez, like the above comment says, this will stop it matching 1000 says do not allow 0 as the next character. – jimplode Apr 12 '11 at 15:27
Ah, ok. Now I've understood. Thanks again for your kindness. :) – Nicola Cossu Apr 12 '11 at 15:32
This solution (`^.*(.*(100|one\shundred)[^0].*)?.*$`) is overusing the greedy-dot-star! (It is doing a _LOT_ of unnecessary work, especially when testing long files which don't match.) And there is no need to match anything before or after the TITLE element. A better (and much faster) expression would be: `[^<]*?\b(100|one\s+hundred)\b[^<]*`. – ridgerunner Apr 12 '11 at 16:16

collapsar · Answer 2 · 2011-04-12T15:19:18.837

1

try

<title>(.*[^[:alnum:]])?(100|one hundred)([^[:alnum:]].*)?</title>

for the pattern to match. pattern syntax is PCRE (like in perl), it can be reformulated if necessary.

best regards,

carsten

ps: beware of the pitfalls - all the recommendations and warnings from the comments do hold; still, in your case, the regex approach seems viable (mainly because you're investigating the 'title' tag's content, there should only be a single one per file and spreading it across multiple lines would be plain silly imho).

edited Apr 12 '11 at 15:19

answered Apr 12 '11 at 15:09

collapsar

17,010
4
35
61

Thanks collapsar. Even your solution seems perfect. I gave you an upvote too but I've accept jimplode's answer because he replied first. Thanks again. You're genious. I'm afraid I'll never learn these regexps. :( – Nicola Cossu Apr 12 '11 at 15:16
thanks. be warned however that the solution you've accepted would match 'one hundredth' too, which might not be what you want. greetz, carsten – collapsar Apr 12 '11 at 15:21
Thanks for the warning. As you have seen I'm totally newbie with regexp, so I'm not able to catch those little particulars. :) I will not have "hundredth" problem cause my native language is italian. I've put the problem in english terms in order to be understood from everybody. I needed to parse italian strings ;) Thanks again. – Nicola Cossu Apr 12 '11 at 15:27

regular expression to parse html title tag

2 Answers2