1

I am trying to batch process (search and replace) a couple hundred thousand html pages with REGEX in Notepad++. All the html pages have the exact same layout and I am basically trying to copy an element (a title) to the page tag wich isn't currently empty

<html>
<head>
<title>some title</title>
<lots of junk and newlines>
</head>
<body>
<lots of stuff, tags, content><span>stuff</span><div>more stuff</div>
<div id="uniqueID">
<span>The Title that should be copied into head's title tag</span>
</div>
...other stuff...</body>

I can find:

The title tag: <title>(.*?)</title>
And the span containing the REAL title: 
(\s*<div id="uniqueID">\s*)<span>(.*)</span>(\s*</div>)

But I can't seem to be able to fit them into one expression (ignoring the junk in between) to be able to search and replace it in Notepad++.

The uniqueID div is the same in every pages (spaces, newlines), there is nothing else in it that the span with it's content. The title tag is obviously present only once in every pages. I just started with regular expressions and the possibilities are endless. I know it's not perfect for parsing HTML but for this case, it should. Anyone knows how to patch theses two expressions together to ignore the in-between content?

Thank you so much!

  • 2
    Rule 1: don't use RegEx to parse HTML. Rule 2: if you still want to parse HTML with RegEx, see rule 1. [RegEx can only match regular languages, and HTML is not a regular language](http://stackoverflow.com/a/590789/930393) – freefaller Apr 01 '14 at 16:13
  • 2
    @freefaller: The OP seems to have already considered this. – J0e3gan Apr 01 '14 at 16:22
  • Try adding a `?` after the asterisks in the latter regex, like you did in the title regex. – Jeff Apr 01 '14 at 16:32
  • Thanks everyone for helping, I know the rules but I would only use it in stuff that's been generated by me and that I know have a constant coding. Love you guys and stay away from clients who cheap out on a real server and want non dynamic websites with over 10 000 pages! arrghh – user3485840 Apr 01 '14 at 17:58

1 Answers1

0

You can use the following in Notepad++'s Replace dialog to copy the title in the span to the title tag...

  • Find what : <title>(.*)</title>(.*<div id="uniqueID">\s*<span>([A-Za-z ']*)</span>\s*</div>)
  • *Replace with : *<title>$3</title>$2

...if you select Regular expression and check . matches newlin in the dialog (yes, "newlin" rather than "newline" - at least in the version of Notepad++ on the machine I am using). By using $2 and $3 you are leveraging backreferences to groups' captured values.

A less constrained pattern to match the spans with the titles runs the risk of grabbing spans later in the files - for example:

<html>
<head>
<title>some title</title>
<lots of junk and newlines>
</head>
<body>
<lots of stuff, tags, content><span>stuff</span><div>more stuff</div>
<div id="uniqueID">
<span>The Title that should be copied into head's title tag</span>
</div>
<div>
<span>The text that should not be copied into the head's title tag but will be</span>
</div>
...other stuff...</body>

If the titles to copy from the spans have additional characters other than uppercase and lowercase alpha characters, digits, spaces, and apostrophes, then you can add to the character group [A-Za-z '] as needed (e.g. [A-Za-z '_] to include underscores). Just watch out for HTML markup characters themselves - e.g. < and >.

J0e3gan
  • 8,740
  • 10
  • 53
  • 80