Regular Expression, only replace first occurrence of HTML tag

Question

I've got several files that have double <body> tags in them (either on purpose or by accident). I'm looking to find the first occurrence only of the <body> tag and append it with additional HTML code. But the second occurrence shouldn't be affected. I'm using TextWrangler. The regex I'm using now replaces both occurrences rather than just the first.

Text:

<body someattribute=...>
existing content
<body onUnload=...>

RegEx I'm using:

Find: (\<body.*\>)

Replace with: 

\n\1
appended HTML code

Current result:

<body someattribute=...>
appended HTML code
existing content
<body onUnload=...>
appended HTML code

So it's adding my appended code twice. I just want it to happen to the first <body...> only.

[Don't use regex's with HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — erip, Dec 17 '15 at 17:34
You should be more specific with your regex, if the two tags aren't 100% identical. If they are, and your text editor (I've never used that one) doesn't have an option to only replace one instance at a time, there's not a whole lot you can do without a much longer and more complicated regex. — Kendra, Dec 17 '15 at 17:38
I think this question is more about your text editor, rather than regular expressions. At the moment I can't think of any regex feature that would help here (except including `someattribute` in your regex, but I think this is not an option) — Andrea Corbellini, Dec 17 '15 at 17:46
@AndreaCorbellini Oh, it's possible to do this with regex- For instance, by using capture groups, capturing the "existing content" and checking down to the next tag, then using a regex to add the new text while keeping the old. But it's much more complex than just using text editor options (there should be one for "replace one" instead of all replaces being "replace all" after all) for sure. — Kendra, Dec 17 '15 at 17:48
@Kendra the two tags are almost never identical.Unfortunately, I have thousands and thousands of HTML files that I need to go through and replace a multitude of issues (this isn't the only regex I'm trying to use). So I'm doing multiple regexs for different situations. — scotthorvath, Dec 17 '15 at 18:38
@erip, you cannot parse HTML with regexps, but as the grammar for the open tag is defined by only a subset of HTML, it results that this sublanguage **is regular**, and as so, **can be parsed** with a regexp. For example, https://regex101.com/r/fQ0dE0/1 shows a demo with `<([^<>"']|"[^"]*"|'[^']*')*\/?>` that will recognise all well formed opening, closing and empty tags in a _xml_ document, with one (and only one) match per tag. — Luis Colorado, Dec 18 '15 at 12:20
Due to the nature of HTML, a construct like `...
...` is actually correct and will start the body at the location of the `
`. Because the body's start tag is optional! So you get a `` inserted automatically. Therefore, the place where the body starts can NOT be determined with a regex that searches for ` — Mr Lister, Dec 21 '15 at 22:27
By the way, it's perfectly valid to have the text " element. — Mr Lister, Dec 21 '15 at 22:31

tekim · Accepted Answer · 2015-12-17T20:40:47.890

3

Regex:

(?s)(<body.*?>)(.*)

Replace:

\1\nappended content\n\2

Explanation:

(?s) makes the . character match new lines. Without this, the . character will match all characters until it hits a new line character.
(<body.*?>) Finds the first "body" and captures as group 1 (\1).
(.*) Finds everything after the first "body", and captures as group 2 (\2).
Replaces everything that was found with group 1 + new line + appended content + new line + group 2

Tested in Notepad++

edited Dec 17 '15 at 20:40

answered Dec 17 '15 at 17:57

tekim

151
6

Thanks for this. I tried it out, but it added the "appended text" below the first tag and the second tag. I can see how it should do what I was hoping for, but it didn't. – scotthorvath Dec 17 '15 at 18:29
@scotthorvath If the regex in this answer is still appending below the second tag as well, it might be that your text editor isn't playing very nicely with regex. Actually, it could be that you don't have the option set for `.` to match new lines. Is there an option for this in your text editor? (If there isn't, it might take regular regex arguments, but I unfortunately don't know any of those offhand.) – Kendra Dec 17 '15 at 18:41
@scotthovath - What application / editor are you using? – tekim Dec 17 '15 at 18:42
@tekim It says in the question they're using TextWrangler. – Kendra Dec 17 '15 at 18:43
@scotthorvath Try adding `(?s)` to the start of this regex. Ex `(?s)()(.*)` as this will make `.` match new lines, and therefore make this regex work. – Kendra Dec 17 '15 at 18:47
@tekim That did it!! Thank you! And thanks as well Kendra. – scotthorvath Dec 17 '15 at 18:57
@Kendra - I have edited my answer to reflect your (?s) suggestion after testing. – tekim Dec 17 '15 at 18:57
@Kendra I have added an explanation of the `(?s)` option to my answer, as well as cleaned up the formatting. Thanks for the suggestion. – tekim Dec 17 '15 at 20:44

Regular Expression, only replace first occurrence of HTML tag

...` is actually correct and will start the body at the location of the `

`. Because the body's start tag is optional! So you get a `` inserted automatically. Therefore, the place where the body starts can NOT be determined with a regex that searches for `

1 Answers1