0

i want to make a regular expression for web scraping

how can i search for multiple line result :

for exemple this is my Html

    <div id="cn-centre-col-inner">

    <p>sothing her</p>
     ...
    </div>

    <div id="ok"> ..</div>

i want to find a regular expression that gieves me this result :

    <div id="cn-centre-col-inner">

    <p>sothing her</p>
     ...
    </div>
Andy Lester
  • 91,102
  • 13
  • 100
  • 152
saidmohamed11
  • 275
  • 5
  • 15

2 Answers2

2

Regex is not the best tool to do this, you should use a html parser instead.

Suppose that you have this regex:

(?s)<div id="cn-centre-col-inner">.*?<\/div>

You will be able to capture what you want like:

<div id="cn-centre-col-inner">

    <p>sothing her</p>
    ...
</div>

But, you can't ensure that the first closing div is the right one. For instance, for this case:

<div id="cn-centre-col-inner">

    <p>sothing her</p>
    ...
    <div>something inner 1</div>
    <div>something inner 2</div>
</div>
<div id="ok"> ..</div>

You will lose content and you will only capture:

<div id="cn-centre-col-inner">

    <p>sothing her</p>
    ...
    <div>something inner 1</div>

Like this: enter image description here

This is a good example to show why regex shouldn't be use to parse complex html. I strongly recommend you to use a html parser.

If you are ultra sure that your div cn-centre-col-inner has not embedded divs, then you can go ahead with the regex above. Actually you can use capturing group to get all the content within the div:

(?s)<div id="cn-centre-col-inner">(.*?)<\/div>
                                  ^---^--- notice the parentheses

enter image description here

Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
1

After reading the warnings about regexs and html, and if it is just for a specific task, you can try something dirty like that:

(<div[^>]*id="cn-centre-col-inner.*</div>)\n<div id="ok"
Gaël Barbin
  • 3,769
  • 3
  • 25
  • 52