I have a regex to parse folder and file names out from a block of HTML code, and exclude filenames with the extension .ini
.
My current regex: /href="([\w]+)(\.[\w]+)*/ig
- Matches group one: 1+ word characters
- Matches group two 0+ times:
.
then 1+ word characters - Flags: match case insensitive and as many as possible
I have tried to use negative lookahead (what I think is the proper solution) time and time again to remove a match if it has the extension .ini
. Sadly, I have failed my mission, and here I am. I chose not to include my attempts above because it would just pollute the question
From reading all over the internet:
- Negative Lookahead
- Match strings not containing a string: https://www.regextester.com/15
- Regular expression for excluding file types .exe and .js
To restate:
- What I have is two groups.
- What I think I should do is use negative lookahead to match for
.ini
, and then if it matches, exclude all groups from that match.
I could figure out how to ignore just the .ini
group, but could not figure out how to get the regex to ignore all groups. Can you please help me figure out the proper regex?
Sample Input String
A sample block of HTML code that I test the regex with.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<title>Index of /images/AAVS</title>
</head>
<body>
<h1>Index of /images/AAVS</h1>
<table>
<tr><th valign="top"><img src="/icons/blank.gif" alt="[ICO]"></th><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr>
<tr><th colspan="5"><hr></th></tr>
<tr><td valign="top"><img src="/icons/back.gif" alt="[PARENTDIR]"></td><td><a href="/images/">Parent Directory</a> </td><td> </td><td align="right"> - </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="20190823/">20190823/</a> </td><td align="right">2019-09-19 19:37 </td><td align="right"> - </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="20190826/">20190826/</a> </td><td align="right">2019-09-19 19:31 </td><td align="right"> - </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="desktop.ini">desktop.ini</a> </td><td align="right">2019-09-19 19:24 </td><td align="right">136 </td><td> </td></tr>
<tr><th colspan="5"><hr></th></tr>
</table>
</body></html>
Also, I would like to say that I am sure there is a much better approach. All critique is welcome!