1

I want to extract data from html. The thing is, that i cant extract 2 of strings which are on the top, and on the bottom of my pattern.

I want to extract 23423423423 and 1234523453245 but only, if there is string Allan between:

                                        <h4><a href="/Profile/23423423423.html">@@@@@@</a>  </h4> said12:49:32
            </div>

                                <a href="javascript:void(0)" onclick="replyAnswer(@@@@@@@@@@,'GET','');" class="reportLink">
                    report                    </a>
                        </div>

        <div class="details">
                            <p class="content">


                       Hi there, Allan.



                                </p>

            <div id="AddAnswer1234523453245"></div>

Of course, i can do something like this: Profile\/(\d+).*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*Allan.*\s*.*\s*.*AddAnswer(\d+). But the code is horrible. Is there any solution to make it shorter?

I was thinking about:

Profile\/(\d+)(.\sAllan)*AddAnswer(\d+)

or

Profile\/(\d+)(.*Allan\s*)*AddAnswer(\d+)

but none of wchich works properly. Do you have any ideas?

audiophonic
  • 171
  • 1
  • 13

3 Answers3

2

You can construct a character group to match any character including newlines by using [\S\s]. All space and non-space characters is all characters.

Then, your attempts were reasonably close

/Profile\/(\d+)[\S\s]*Allan[\S\s]*AddAnswer(\d+)/

This looks for the profile, the number that comes after it, any characters before Allan, any characters before AddAnswer, and the number that comes after it. If you have single-line mode available (/s) then you can use dots instead.

/Profile\/(\d+).*Allan.*AddAnswer(\d+)/s

demo

Strikeskids
  • 3,932
  • 13
  • 27
  • This won't work for multiple instances (it will only capture the last), see my answer for a better solution. – Jan May 03 '16 at 16:32
0

You can use m to specify . to match newlines.

/Profile\/(\d+).+AddAnswer(\d+)/m

chifung7
  • 2,531
  • 1
  • 19
  • 17
  • 3
    Come on, this is utterly wrong - `s` is for single line mode, `m` is for multiline to match `^` and `$`. – Jan May 03 '16 at 14:32
0

Better use a parser instead. If you must use regular expressions for whatever reason, you might get along with a tempered greedy solution:

Profile/(\d+)            # Profile followed by digits
(?:(?!Allan)[\S\s])+     # any character except when there's Allan ahead
Allan                    # Allan literally
(?:(?!AddAnswer)[\S\s])+ # same construct as above
AddAnswer(\d+)           # AddAnswer, followed by digits

See a demo on regex101.com

Community
  • 1
  • 1
Jan
  • 42,290
  • 8
  • 54
  • 79
  • I believe that regex with non-greedy matches like this might perform better: `/Profile\/(\d+)[\s\S]*?Allan[\s\S]*?(\d+)/g` Regex101 shows that 9110 steps are needed with your match pattern, while only 2740 steps are needed with this non-greedy one. – Petr Srníček May 03 '16 at 16:49