RegEx: Get content from multiple concatenated HTML-Files

Question

I have a bunch of html-files that I concat and want to get the actual contents only. However, I'm having some trouble with finding the correct regex for that. Basically I'm trying to remove everything before, in between and after certain boundaries. Its somewhat similar to Regular expression to match a line that doesn't contain a word? however as I feel more complex. I'm having no luck.

Source-Data:

Stuff I dont need before

<div id="start">
blablabla11
blablabla12
<div id="end">

Stuff I dont need in the middle1

<div id="start">
blablabla21
blablabla22
<div id="end">

Stuff I dont need in the middle2

<div id="start">
blablabla31
blablabla32
<div id="end">

Stuff I dont need in the end

Desired result:

<div id="start">
blablabla11
blablabla12
<div id="end">

<div id="start">
blablabla21
blablabla22
<div id="end">

<div id="start">
blablabla31
blablabla32
<div id="end">

Context: I'm working in Sublime (Mac) -> Perl Regex

My current approach is based on inverse matching / regex-lookarounds (I know, there is lots of discussion about wording/methods/uglyness etc around this topic, however I must not care as I need to get the job done) :

Find: (?s)^((?!(<div id="start">)(?s)(.*?)(<div id="end">)).)*$
Replace: $3

And many more variants, I've been testing and playing around. However, it yields to:

blablabla11
blablabla12

<div id="start">

blablabla21
blablabla22

<div id="start">

blablabla31
blablabla32

<div id="start">

Nice, but not there yet. And whatever I'm trying I'm stumbling into other problems. Noob at work I guess.

Thanks a gazillion for your help guys!

Chris

EDIT: Thank you for the first answers! However I must admit that my minimal example is a bit misleading (because too easy). In reality I am facing hundrets of complex and diverse html-files concatenated into one single large file. The only common bits are that the content of every html-file starts with a known string (here simplified as ) and ends with a known string (here simplified as ). And the content as such obviously has loads of different tags etc. So just testing for opening and closing tags sadly wont cut it

Try `(?s).*?(
.*?
)(?:(?:(?!
).)*$)?` and replace with `$1\n\n`. See [demo](https://regex101.com/r/NPkRqD/1). — Wiktor Stribiżew, Nov 14 '18 at 19:52
Not sure what the div element is but you can't just use something like `
.*? — , Nov 14 '18 at 20:23
What language or regex engine are you using ? I might be able to give you a template regex that you can put stuff into. — , Nov 14 '18 at 20:52
@WiktorStribiżew - you are awesome, man! at first I overlooked your comment, however it works perfectly, even on my real-life files. PERFECT, how cool is that! Would you be so kind to help me understand how (?:(?:(?!
).)*$)? works so one-day I can write regex myself instead of C&P? ;)) — Wirsing, Nov 14 '18 at 23:00

score 1 · Accepted Answer · answered Nov 14 '18 at 23:10

You may look for

(?s).*?(<div id="start">.*?<div id="end">)(?:(?:(?!<div id="start">).)*$)?

and replace with $1\n\n. See regex demo.

Details

(?s) - DOTALL modifier, . now matches any char
.*? - any 0+ chars, as few as possible
(<div id="start">.*?<div id="end">) - Group 1: <div id="start">, any 0+ chars as few as possible, and <div id="end">
(?:(?:(?!<div id="start">).)*$)? - an optional non-capturing group matching 1 or 0 occurrence of
- (?:(?!<div id="start">).)* - any char, 0 or more occurrences, that does not start a <div id="start"> char sequence (aka tempered greedy token)
- $ - end of string.

Forgot to answer here to give you credit: Thank you so much, @Wiktor. You are a real asset to the regex community! — Wirsing, Nov 26 '18 at 22:52

RegEx: Get content from multiple concatenated HTML-Files

1 Answers1