RegEx for capturing src attribute value within escaped strings with specific file extensions

Question

I'm migrating a large set of articles from an old CMS to a new one. As a part of this process I need to capture any links pointing to local assets and send them to the new CMS, then relink with the new asset link. My problem is that I can't figure out the regex for capturing the links properly. The article HTML is exported as a long string in a JSON document, I'm currently trying to capture png, gif, jpg and pdf. As it is attributes within a string, all values are within escaped quotation marks like this: <a href=\"http://www.mysite.no/webdocs/mypdf.pdf\">

My latest attempt looks like this: string pattern = "(?<=\\\")([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\\.(?:pdf|jpg|png|gif))(?=\\\")";

It worked for the most part, until I ran into images from the old CMS where the alt and title attribute were just the filename with file extension. This matched with the regex and caused errors.

I've also tried this: string pattern = @"(?:src|href)=[^\s]*\.(pdf|png|jpg|png|gif)"; Firstly this is capturing the entire attribute, and secondly it has captured text within the anchor tag on certain occasions when there are no space between the link and the closing tag, like this: http://www.mysite.no/webdocs/mypdf.pdf">http://www.mysite.no/webdocs/mypdf.pdf

I've been having troubles with this for ages. Any help is greatly appreciated. To boil it down, I want a regex that can capture the value of an src or href attribute that is wrapped in escaped quotation marks and has either a pdf, png, jpg or gif extension.

My migration script is written in C#

have a look at [this](https://stackoverflow.com/a/1732454/5174469) — Mong Zhu, May 31 '23 at 11:35
Use a proper XHTML or HTML parser like HtmlAgilityPack. You just need a XQuery path like `//a/@href` and the parser will pull out the correct attribute values for you. Regex is the wrong tool for this job — Charlieface, May 31 '23 at 11:44
@Charlieface That achieved what I've been trying for the past 6 hours in 5 minutes. Thanks for the help! — Greger Gundersen, May 31 '23 at 11:56

RegEx for capturing src attribute value within escaped strings with specific file extensions

0 Answers0