0

I am scraping a site and getting this:

<input type="BUTTON" value="Geographic Footprint" name="GEO_FOOTPRINT" onclick="return OpenModalDialog('https://mspfast.elavon.com/Symphony/client/client.do?uid=0XrHleUX5MudUYVwwsGDYCl&novaid=5418812&readonly=Y&context=BOARDING&defaultRoute=GeographicFootprint')">

What I want is to just grab the uid: 0XrHleUX5MudUYVwwsGDYCl

I am quite new to regex and don't really understand how it works.

I've tried doing:

'/value="Geographic Footprint" name="GEO_FOOTPRINT" onclick="return OpenModalDialog(\'https://mspfast.elavon.com/Symphony/client/client.do?uid=([a-zA-Z0-9]+)\&/'

as the regex but it does not work. I get the error of unknown modifier '/'

Derek
  • 2,927
  • 3
  • 20
  • 33
  • 2
    "I am quite new to regex and don't really understand how it works" and yet you are trying to use it instead of using a HTMl parser? – PeeHaa Dec 02 '15 at 17:56
  • 1
    @PeeHaa if someone is not familiar with regex, do you think they would know when to use it or an HTML parser (which they no doubt are not familiar with either)? – Digital Chris Dec 02 '15 at 17:58
  • 2
    You forgot to escape the `/` in the url... you should probably learn more about regexes before you try to parse html **AND** javascript with them simultaneously. – Marc B Dec 02 '15 at 17:58
  • It's hard enough to parse tag/attr-val but then to parse the url at the same time might be rough. Are you sure you want to use a regex ? –  Dec 02 '15 at 18:15
  • Here not RegEx, but fun $uid = explode('=', explode('&', explode('?', $str)[1])[0]); – nestedl00p Dec 02 '15 at 18:18
  • 1
    What makes this node unique? Which attribute, value? I could help with a DOM example. – Wiktor Stribiżew Dec 02 '15 at 18:21
  • 2
    @stribizhev this is the only input with the name geo_footprint – Derek Dec 02 '15 at 18:21
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/ – Alessandro Da Rugna Dec 02 '15 at 18:28
  • Once you select a delimiter (`/` in this case) you can't use that character in your regex unless you escape it. http://php.net/manual/en/regexp.reference.delimiters.php – chris85 Dec 02 '15 at 19:31

2 Answers2

1

Here is a way to access the only element with name attribute having GEO_FOOTPRINT value:

$html = '<body><input type="BUTTON" value="Geographic Footprint" name="GEO_FOOTPRINT" onclick="return OpenModalDialog(\'https://mspfast.elavon.com/Symphony/client/client.do?uid=0XrHleUX5MudUYVwwsGDYCl&novaid=5418812&readonly=Y&context=BOARDING&defaultRoute=GeographicFootprint\')"></body>';
libxml_use_internal_errors(true);
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);
$link = $xpath->query('//input[@name="GEO_FOOTPRINT"]')->item(0);
$val = $link->getAttribute('onclick');

Now, once we have the text of the onclick attribute value, we can consider several ways of getting the uid value. Here is a regex one:

preg_match('~[?&]uid=([^&\s]+)~', $val, $m);
echo $m[1];

The regex [?&]uid=([^&\s]+) matches ? or &, then uid sequence, then =, and then matches and captures into Group 1 one or more characters other than & or whitespace (\s) (so that we do not cross another query param).

There can be other regexps (you may add OpenModalDialog\(\'http\S*? at the beginning of the pattern to restrict it), or try string split / substr functions, etc.

See IDEONE demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Here is an example with a named group:

$str = "<input type=\"BUTTON\" value=\"Geographic Footprint\" name=\"GEO_FOOTPRINT\" onclick=\"return OpenModalDialog('https://mspfast.elavon.com/Symphony/client/client.do?uid=0XrHleUX5MudUYVwwsGDYCl&novaid=5418812&readonly=Y&context=BOARDING&defaultRoute=GeographicFootprint')\">";
$regex = '/uid=(?P<uid>[^&]+)/';
// search for uid literally, afterwards match everything except an ampersand 
// and capture it in a group called "uid"

preg_match_all($regex, $str, $matches);
$uid = $matches["uid"][0];
// uid: 0XrHleUX5MudUYVwwsGDYCl

While this might work for this particular example, it's almost allways better to use a parser (e.g. SimpleXML) for these tasks.

Jan
  • 42,290
  • 8
  • 54
  • 79