Parse HTML using perl regex

Question

I created a Perl script that would use an online website to crack MD5 hashes after the user inputs the hashes. I am partially successful as I am able to get the response from the website, though I need to parse the HTML and display the hash, and corresponding password in clear text to the user. The following is the output snippet I get now:

<strong>21232f297a57a5a743894a0e4a801fc3</strong>: admin</p>

Using regex buddy, I was able to use the following expression [a-z0-9]{32} to match the hash part alone. I need the final output in the following format:

21232f297a57a5a743894a0e4a801fc3: admin

Any help would be appreciated. Thank you!

Take a look at http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. — simbabque, Feb 20 '14 at 13:23
I don't understand the scenario: if it is your website, and the user is already using form elements, why can't you just use the POST / GET parameter? — cypherabe, Feb 20 '14 at 13:28
Thanks Mpapec. That worked! Cypherabe: It is not my website. I am just using an online hash cracking service. The hash is however sent via the tool and response is parsed. — bAd bOy, Feb 21 '14 at 02:16

brian d foy · Answer 1 · 2023-06-23T05:52:19.010

Ten years later, we have more sophisticated and easier solutions.

If you can write a CSS selector, you can easily pull out the parts of a response without dealing with the complexity of HTML::Parser and other approaches. Mojo::UserAgent does it all for you:

use v5.16;

# This is a real URL just for this answer
my $url = 'https://gist.githubusercontent.com/briandfoy/85033496f93e860cdf53f45ba931e8f7/raw/a0876e090fc5a3c1b75ff2a19580f731b784828d/selector_example.html';

use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;

my $tx = $ua->get($url);
my $hash = $tx->res->dom->at( 'div p strong' )->text;  # or whatever selector

say $hash;

The at only finds the first node that matches the selector. If there are multiple nodes, use find instead. You get matches as a "collection" and transform them the same way inside map:

my $hashes = $tx->res->dom->find( 'div p strong' )
    ->map( sub {$_->text} )
    ->join("\n");

The particular selector, div p strong, depends on the HTML you get back, and is much easier to pinpoint an element if there are id or class values. I go through these extensively in Mojo Web Useragents.

score 2 · Accepted Answer · answered Feb 20 '14 at 13:21

2

I think you'd be much better off using HTML::Parser to simply/reliably parse that HTML. Otherwise you're into the nightmare of parsing HTML with regexps, and you'll find that doesn't work reliably.

answered Feb 20 '14 at 13:21

Brian Agnew

268,207
37
334
440

1. Find a *regex HTML* question. 2. Post a comment answer "use a parser". 3. ???? 4. Profit!! ;p – Qtax Feb 20 '14 at 13:22
2

I'd like to think actually pointing the OP to a specific Perl HTML parsing module is a little more than that – Brian Agnew Feb 20 '14 at 13:23
2

I prefer pointing them to [THE parsing HTML question](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)... :) – simbabque Feb 20 '14 at 13:24
Thanks Brian. I am pretty new to Perl. I would require time to figure out using the different modules available. For now, I got this working using the little regex tip from Mpapec. – bAd bOy Feb 21 '14 at 02:21
@bAd bOy - I'd perhaps rephrase that as an answer and accept it (or get Mpapec to do it!) – Brian Agnew Feb 21 '14 at 10:03

score 1 · Answer 3 · answered Feb 20 '14 at 13:27

There are a few tools that can handle both fetching and parsing the page for you available on CPAN. One of them is Web::Scraper. Tell it what page to fetch and which nodes (in xpath or CSS syntax) you want, and it will get them for you. I'll not give an example as I don't know your URL.

There is a good blogpost about this on blogs.perl.org by stas that uses a different module that might also be helpful.

score 0 · Answer 4 · answered Feb 20 '14 at 15:54

0

Here it is:

$str = q{<strong>21232f297a57a5a743894a0e4a801fc3</strong>: admin</p>};
@arr = $str =~ m{<strong>(.+)</strong>(.+)</p>};
print(join("", @arr), "\n");

answered Feb 20 '14 at 15:54

ulan

23
3

score -1 · Answer 5 · edited Jun 26 '23 at 15:59

So, doing this reliably in Perl is possible because Perl's regexps have an extended format with the complexity required to parse HMTL (you can actually embed Perl code in your regex and your regex within Perl code, but more to the point here, you can have recursive regular expressions that can actually parse HTML). However, you really don't want to do this. If you decide to do so anyway, please read up on advanced Perl regular expressions so you have some idea of what you are getting into.

The actual problem here is that parsing things like HTML reliably is an immense process and it is extremely tempting to use naive regex implementations to do this. Those will break because regex is (outside of Perl) a regular language and thus not capable of parsing anything other than a regular language. Perl is different and the addition of things like recursive regular expressions make this possible but very difficult, bug prone, and complex. Moreover, you would probably only be able to support, realistically, one variant of HTML.

HTML::Parser does the main work in a C, XS module which can take advantage of pre-existing C libraries. This means, importantly not maintaining for yourself a massive nightmarish codebase.

Parse HTML using perl regex

5 Answers5