0

I need to parse a lot of html files in order to know which ones contain specific text within title tag.

Let's suppose that titles are

file1.htm
<title>100 text other text</title>
file2.htm
<title>text 100 text other text</title>
file3.htm
<title>text 1000 text other text</title>
file4.htm
<title>text one hundred text other text</title>

Following my example I need to find files name that contain 100 or one hundred, that is files 1,2 and 4.

My problem is that I don't know how to write regular expression

gci "c:\my_folder" | ? {$_.extension -eq ".htm"} | 
select-string -pattern '<title>*100*</title>' |
Select-Object -Unique Path

Please note, if this may be important for regexp, that title tag is not at the beginning of a row but in the middle. Thanks in advance.

Nicola Cossu
  • 54,599
  • 15
  • 92
  • 98
  • 1
    normally, using regex to parse HTML is bad. just FYI. – Muad'Dib Apr 12 '11 at 14:59
  • 1
    Obligatory warning about parsing HTML using regular expressions: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Gareth McCaughan Apr 12 '11 at 15:00
  • To expand a little on this point: the contents of the `title` element might extend over multiple lines; they might contain other HTML markup. HTML markup in general is unsuited to regular-expression processing because of its nested structure. Perhaps in this case you know exactly how your input files were made, and you know that the title is always on a single line and never contains anything difficult. **IF** that is so, then regular expressions may not be a crazy approach. Otherwise, you really ought to use a proper parser. – Gareth McCaughan Apr 12 '11 at 15:03
  • 1
    Thanks for your replies. I've read the link but if it's bad and I can't parse them do I have some alternative? edit. Now I read your last reply. – Nicola Cossu Apr 12 '11 at 15:04
  • I don't know anything about PowerShell, but you should try replacing `*` with `.*` in the pattern. `.` matches any character. `*` means "any number of things matching the thing I just said". So `.*` means "any number of arbitrary characters". But, I repeat, please consider carefully whether regular expressions are really an appropriate tool for this job. – Gareth McCaughan Apr 12 '11 at 15:05
  • Thanks again Gareth for your detailed answers. I'll google for some alternative if powershell is not the right tool for this job. – Nicola Cossu Apr 12 '11 at 15:07
  • Regexes are _fun!_ But I would strongly recommend spending an hour or two studying the basics. There is an excellent online tutorial at: [www.regular-expressions.info](http://www.regular-expressions.info/). The time you spend there will pay for itself many times over. Happy regexing! – ridgerunner Apr 12 '11 at 16:04

2 Answers2

2

This should do it.

^.*<title>(.*(100|one\shundred)[^0].*)?</title>.*$
jimplode
  • 3,474
  • 3
  • 24
  • 42
  • May I ask you what [^0] does it mean? – Nicola Cossu Apr 12 '11 at 15:21
  • Don't allow 1000, which 100+0 – user unknown Apr 12 '11 at 15:24
  • @nick rulez, like the above comment says, this will stop it matching 1000 says do not allow 0 as the next character. – jimplode Apr 12 '11 at 15:27
  • Ah, ok. Now I've understood. Thanks again for your kindness. :) – Nicola Cossu Apr 12 '11 at 15:32
  • This solution (`^.*(.*(100|one\shundred)[^0].*)?.*$`) is overusing the greedy-dot-star! (It is doing a _LOT_ of unnecessary work, especially when testing long files which don't match.) And there is no need to match anything before or after the TITLE element. A better (and much faster) expression would be: `[^<]*?\b(100|one\s+hundred)\b[^<]*`. – ridgerunner Apr 12 '11 at 16:16
1

try

<title>(.*[^[:alnum:]])?(100|one hundred)([^[:alnum:]].*)?</title>

for the pattern to match. pattern syntax is PCRE (like in perl), it can be reformulated if necessary.

best regards,

carsten

ps: beware of the pitfalls - all the recommendations and warnings from the comments do hold; still, in your case, the regex approach seems viable (mainly because you're investigating the 'title' tag's content, there should only be a single one per file and spreading it across multiple lines would be plain silly imho).

collapsar
  • 17,010
  • 4
  • 35
  • 61
  • Thanks collapsar. Even your solution seems perfect. I gave you an upvote too but I've accept jimplode's answer because he replied first. Thanks again. You're genious. I'm afraid I'll never learn these regexps. :( – Nicola Cossu Apr 12 '11 at 15:16
  • thanks. be warned however that the solution you've accepted would match 'one hundredth' too, which might not be what you want. greetz, carsten – collapsar Apr 12 '11 at 15:21
  • Thanks for the warning. As you have seen I'm totally newbie with regexp, so I'm not able to catch those little particulars. :) I will not have "hundredth" problem cause my native language is italian. I've put the problem in english terms in order to be understood from everybody. I needed to parse italian strings ;) Thanks again. – Nicola Cossu Apr 12 '11 at 15:27