0

I am trying to match multi-line HTML source code with a regular expression (using AutoIt). HTML source code to match:

<li class="mission">
    <div>
        <div class="missionTitle">
            <h3>Eat a quarter-pounder with cheese</h3>
            <div class="missionProgress">
                <span>100%</span>
                <div class="missionProgressBar" style="width: 100%;"></div>
            </div>
        </div>
        <div class="missionDetails">
            <ul class="missionRewards">
                <li class="rewardCash">5,000&ndash;8,000</li>
                <li class="rewardXP">XP +5</li>
                                </ul>
                            <div class="fightItems clearfix">
                <h5><span>Prerequisites:</span></h5>
                                    <div class="fightItemsWrap">
                                            <div class="fightItem tooltip" title="Sunglasses" data-attack="Attack: 2" data-defence="Defence: 2">
                        <img src="/img/enhancement/3.jpg" alt="">
                        <span>&times; 1</span>
                    </div>
                                            <div class="fightItem tooltip" title="Broad Shoulders" data-attack="Attack: 0" data-defence="Defence: 3">
                        <img src="/img/enhancement/1003.jpg" alt="">
                        <span>&times; 1</span>
                    </div>
                                            <div class="fightItem tooltip" title="Irish Fond Anglia" data-attack="Attack: 4" data-defence="Defence: 8">
                        <img src="/img/enhancement/2004.jpg" alt="">
                        <span>&times; 1</span>
                    </div>
                                        </div>
            </div>
                            <form action="/quest/index/i/kdKJBrgjdGWKqtfDrHEkRM2duXVn1ntH/h/c0b2d58642cd862bfad47abf7110042e/t/1336917311" method="post">
                <input type="hidden" id="id" name="id" value="17"/>
                <button class="button buttonIcon btnEnergy"><em>5</em></button>
            </form>
        </div>
    </div>
</li>

It is present multiple times on a single page (but items within <div class="fightItems clearfix">...</div> vary).

  • I need to match
    • <h3>Eat a quarter-pounder with cheese</h3>,
    • the first span <span>100%</span> and
    • <input type="hidden" id="id" name="id" value="17"/>.

Expected result (for every occurrence on a page):

$a[0] = "Eat a quarter-pounder with cheese"
$a[1] = "100%"
$a[2] = "17"

What I came up with:

(?U)(?:<div class="missionTitle">\s+<h3>(.*)</h3>\s+<div class="missionProgress">\s+<span>(.*)</span>)|(?:<form .*\s+.*<input\stype="hidden"\sid="id"\sname="id"\svalue="(\d+)"/>\s+.*\s+</form>)

But that leaves some array-items empty. I also tried the (?s) flag, but then it only captures first occurrence (and stops matching after).

user4157124
  • 2,809
  • 13
  • 27
  • 42
t0r
  • 9
  • 2
  • 1
    I'm not familiar with autoit, but have a look if it has some html/xml support. You'll be much better off using a proper parser. – carlpett May 13 '12 at 20:20
  • Next time, I wouldn't use a regular expression for this purpose. But rather a for loop that iterates on each line of the string and takes action based on whether or not a certain substring is there. It's a bit more flexible. – Jos van Egmond May 14 '12 at 09:03

2 Answers2

1

I had not to use . to match words or integers, because of the (?s) flag. The correct regex is:

(?U)(?s)<div class="missionTitle">\s+<h3>([\w\s]+)</h3>(?:.*)<div class="missionProgress">\s+<span>(\d+)%</span>(?:.*)<input.* value="(\d+)"/>
user4157124
  • 2,809
  • 13
  • 27
  • 42
Matt
  • 7,100
  • 3
  • 28
  • 58
0

Regular expression to match multi-line HTML source code:

  • As per documentation;

    • \R matches newline characters (?>\r\n|\n|\r),
    • dot . does not (unless (?s) is set).
    • \s matches white space characters.
  • Generally some combination is required (like \R\s*?).

  • Non-capturing groups are redundant (match without capturing instead).
  • If uniquely enclosed, single characters may be excluded instead (like attribute="([^"]*?)" for text between double-quotes).

Example (contains double-quotes; treat as per Documentation - FAQ - double quotes):

(?s)<div class="missionTitle">.*?<h3>(.*?)</h3>.*?<div class="missionProgress">.*?<span>([^<]*?)</span>.*?<input type="hidden" id="id" name="id" value="([^"]*?)"/>

Visual explanation:

Regular expression image Regular expression image

If regular expressions should be used on HTML (beyond simple listings like this) is a different question (been, done, T-shirt).

user4157124
  • 2,809
  • 13
  • 27
  • 42