How to grab a section with Regex?

Question

I'd like to grab the various sections in my code with Regular expressions. I want to write four different regex expressions. The first one is simple, which is to grab the first line that begins with the word extends. The next three need to grab the sections denoted block head, block body, and block scripts.

I'm a bit lost. So far I've got /^block/m

I'm not looking to respect indentation, just using it for my own visual organization.

extends standard

block head

  <title>title</title>
  <meta name="description" content="A wonderful thing.">

block body

  <h1>Title</h1>
  <p>A wonderful paragraph...</p>

block scripts

  <script src="/javascritps/html5shiv.js"></script>

I need to be able to grab the identifier after the word block.

Also, separately, I need to grab the HTML content after each block ____ statement.

What will you do if the HTML contains "block"? You will need a HTML parser. You can't parse HTML with regex. — Oriol, Dec 27 '16 at 02:01
Well, block will not appear after a newline. I'm open to suggestions on parsers to use. — Costa Michailidis, Dec 27 '16 at 02:04

antoni · Accepted Answer · 2016-12-27T02:23:43.127

4

You have a good start: here is how to do using lookbehind: /(?<=^block )\w+\n/mg

See it in action here: https://regex101.com/r/bFhNSO/1

[EDIT] for explanations.

Using a lookbehind is more complex syntax but allows you to only capture the word you need, without the word "Block".

Still if you don't care, or if you do it on JS you can do the same with:

/^block (\w+)\n/mg then you need to capture.

[EDIT] After question changes.

So for JS with no lookbehind and grabbing also the html all in one regex, you can use something like this: /block (\w+)\n+([\s\S]*?)(?=\s+\nblock|$)/g.

See it working here: https://regex101.com/r/bFhNSO/2.

Note that I changed the flavor to js in regex101.

[EDIT] add more details.

First, flag g is for global so you can match multiple instances of the same pattern.
(\w+) captures a word basically its like [a-z_]+ so you may want to change it to more permissive according to your needs.
([\s\S]*?) captures anything, so it is like .* that you usually see, but particularly in JS you don't have the s flag for matching any spacing char with . so the longhand equivalent is [\s\S]+, matching any \s AND any NOT \s with \S. The ? is for greediness, meaning you want to take the smallest match possible, you can try the regex without and you will understand the difference.
(?=\s+\nblock|$) is a lookahead, allowed in JS, to make sure your previous match is followed by either the word block or the end of document with $.

That's it, hope it helps people! :)

edited Dec 27 '16 at 02:23

answered Dec 27 '16 at 01:17

antoni

5,001
1
35
44

Oooo fancy. That's super helpful. How do I get the HTML content in each section? Let me update my question, because both of these are essential : ) – Costa Michailidis Dec 27 '16 at 01:20
nope not supported, there is always a way to workaround though doing all the opposite using lookahead! if you are using JS add the tag to your question! – antoni Dec 27 '16 at 01:26
Cool. I can easily grab the line and substring or split the words. – Costa Michailidis Dec 27 '16 at 01:31
You rock!!! Regex, always looks like magic when it grows past a few notions. Let me see if I can grasp this one.... We've removed the multi line flag because otherwise we couldn't capture more than one line of html. So the regex says `/block ` capture any instances (plural since we have the global flag set) beginning with the word 'block' followed by a space. Then capture `(\w+)` any number of word characters until `\n+` 1 or more new lines. Continue by capturing `([\s\S]*?)`, or \w\W might work too. Then I start to get lost, haha... I'm gonna research more... – Costa Michailidis Dec 27 '16 at 02:03
haha thanks let me edit and add few more details. but u seem quite familiar already – antoni Dec 27 '16 at 02:12
This assumes the HTML won't contain "block". [You can't parse HTML with regex](http://stackoverflow.com/a/1732454/1529630) – Oriol Dec 27 '16 at 02:15
You are right @Oriol, to add more security you can do like `block\s+(\w+)\n+([\s\S]*?)(?=\s+\n+block\s+\w+\n+|$)` but still it is not silverbulletproof haha. the presented regex is only meant for this particular example. – antoni Dec 27 '16 at 02:29
@Oriol, you sure? I think it only assumes the HTML won't contain the word block when it's immediately preceded by a newline. In which case, well, that's actually just fine, because that's invalid HTML anyway. – Costa Michailidis Dec 27 '16 at 02:30
lol, that is why this is more secure `block\s+(\w+)\n+([\s\S]*?)(?=\s+\n+block\s+\w+\n+|$)` https://regex101.com/r/bFhNSO/10 – antoni Dec 27 '16 at 02:40
1

@Oriol But antoni's regex has [`(?=\s+\nblock|$)`](https://regex101.com/r/bFhNSO/11) and not `(?=\s+block|$)`. If OP's html is always indented and the words like `block` are never, it should always work (: yes you can – bobble bubble Dec 27 '16 at 02:46
@bobblebubble Ah, the regex101 is not updated with that. Yes, if the indentation is respected then it will work, otherwise: https://regex101.com/r/bFhNSO/12 – Oriol Dec 27 '16 at 02:53

How to grab a section with Regex?

1 Answers1