Bash - Regex for HTML contents

Question

I'm learning about Bash scripting, and need some help understanding regex's.

I have a variable that is basically the html of a webpage (exported using wget):

currentURL = "https://www.example.com"
currentPage=$(wget -q -O - $currentURL)

I want to get the id's of all linked photos in this page. I just need help figuring out what the RegEx should be.

I started with this, but I need to modify the regex:

Test string (this is what currentURL contains, there can be zero to many instances of this):

<a href="./download/file.php?id=123456&mode=view"><img src="./download/file.php?id=123456&t=1"></a>

Current Regex:

.\/download\/file.php\?id=[0-9]{6}\&mode=view

Here's the regex I created, but it doesn't seem to work in bash.

The best solution would be to have the ID of each file. In this case, simply 123456. But if we can start with getting the /download/file.php?id=123456, that'd be a good start.

regex is not the right tool for such cases. xml/html parsers should be used. Post the actual testable url — RomanPerekhrest, Mar 18 '18 at 20:01

Gilles Quénot · Accepted Answer · 2023-04-30T19:07:46.657

6

Don't parse XML/HTML with regex, use a proper XML/HTML parser.

theory :

According to the compiling theory, HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

realLife©®™ everyday tool in a shell :

You can use one of the following :

xmllint often installed by default with libxml2, xpath1

xmlstarlet can edit, select, transform... Not installed by default, xpath1

xpath installed via perl's module XML::XPath, xpath1

xidel xpath3

saxon-lint my own project, wrapper over @Michael Kay's Saxon-HE Java library, xpath3

or you can use high level languages and proper libs, I think of :

python's lxml (from lxml import etree)

perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath

php's DOMXpath

Check: Using regular expressions with HTML tags

Example using xidel:

xidel -s "$currentURL" -e '//a/extract(@href,"id=(\d+)",1)'

edited Apr 30 '23 at 19:07

answered Mar 18 '18 at 20:10

Gilles Quénot

173,512
41
224
223

Thank you for explaining why a RegEx is not suitable for parsing HTML. I'll try this out. – Mr. C Mar 18 '18 at 21:02
I tried this out but could not get it to work. How do I feed the contents of `currentURL` into this? – Mr. C Mar 18 '18 at 23:31
1

Xidel can do this all by itself without the need of grep or sed: `./xidel -s "$currentURL" -e '//a/extract(@href,"id=(\d+)",1)'` – Reino Mar 19 '18 at 15:18
Thanks @Reino, added it to my post. I start learning xidel even if I know it since a couple of years – Gilles Quénot Mar 19 '18 at 15:27
@Reino: do you have doc on this syntax? `'//a/extract(@href,"id=(\d+)",1)'` – Gilles Quénot Feb 18 '23 at 21:15
For [`extract()`](https://www.benibela.de/documentation/internettools/xpath-functions.html#x-extract) you mean? – Reino Feb 19 '23 at 14:26
Thanks. And the use of `1` ? This is more like `grep -oP`, you should correct your doc. Very nice trick – Gilles Quénot Feb 19 '23 at 14:31
OK: `If the $match argument is provided, only the $match-th submatch will be returned.` – Gilles Quénot Feb 19 '23 at 14:42
1

I'm not `xidel`'s author, @BeniBela is. If you're correct, then _he_ should fix that. Btw, if you want to learn more about all the things you can do with `xidel`, then you might find [my hobby-project](https://github.com/Reino17/xivid/) interesting (the XQuery Module _'xivid.xqm'_ in particular). Beware that the documentation is in Dutch. – Reino Feb 19 '23 at 15:11
https://github.com/benibela/xidel/issues/101 – Gilles Quénot Feb 19 '23 at 15:32

janos · Answer 2 · 2018-03-18T20:26:27.280

Let's first clarify a couple of misunderstandings.

I'm learning about Bash scripting, and need some help understanding regex's.

You seem to be implying some sort of relation between Bash and regex. As if Bash was some sort of regex engine. It isn't. The [[ builtin is the only thing I recall in Bash that supports regular expressions, but I think you mean something else.

There are some common commands executed in Bash that support some implementation of regular expressions such as grep or sed and others. Maybe that's what you meant. It's good to be specific and accurate.

I want to get the id's of all linked photos in this page. I just need help figuring out what the RegEx should be.

This suggests an underlying assumption that if you want to extract content from an HTML, then regex is the way to go. That assumption is incorrect.

Although it's best to extract content from HTML using an XML parser (using one of the suggestions in Gilles' answer), and trying to use regex for it is not a good reflect, for simple cases like yours it might just be good enough:

grep -oP '\./download/file\.php\?id=\K\d+(?=&mode=view)' file.html

Take note that you escaped the wrong characters in the regex:

/ and & don't have a special meaning and don't need to be escaped
. and ? have special meaning and need to be escaped

Some extra tricks in the above regex are good to explain:

The -P flag of grep enables Perl style (powerful) regular expressions
\K is a Perl specific symbol, it means to not include in the match the content before the \K
The (?=...) is a zero-width positive lookahead assertion. For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab in the match.
The \K and the lookahead trickery is to work with grep -o, which outputs only the matched part. But without these trickeries the matched part would be for example ./download/file.php?id=123456&mode=view, which is more than what you want.

*As if Bash was some sort of regex engine. It isn't.* While I agree with what you are saying, to be accurate, Bash does implement ERE regex substantially similar to POSIX grep. ERE regex is still not suitable for parsing HTML. — dawg, Mar 18 '18 at 23:22

Bash - Regex for HTML contents

2 Answers2

theory :

realLife©®™ everyday tool in a shell :

or you can use high level languages and proper libs, I think of :

Linked