I'm migrating a large set of articles from an old CMS to a new one. As a part of this process I need to capture any links pointing to local assets and send them to the new CMS, then relink with the new asset link.
My problem is that I can't figure out the regex for capturing the links properly. The article HTML is exported as a long string in a JSON document, I'm currently trying to capture png, gif, jpg and pdf.
As it is attributes within a string, all values are within escaped quotation marks like this:
<a href=\"http://www.mysite.no/webdocs/mypdf.pdf\">
My latest attempt looks like this:
string pattern = "(?<=\\\")([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\\.(?:pdf|jpg|png|gif))(?=\\\")";
It worked for the most part, until I ran into images from the old CMS where the alt and title attribute were just the filename with file extension. This matched with the regex and caused errors.
I've also tried this:
string pattern = @"(?:src|href)=[^\s]*\.(pdf|png|jpg|png|gif)";
Firstly this is capturing the entire attribute, and secondly it has captured text within the anchor tag on certain occasions when there are no space between the link and the closing tag, like this:
http://www.mysite.no/webdocs/mypdf.pdf">http://www.mysite.no/webdocs/mypdf.pdf
I've been having troubles with this for ages. Any help is greatly appreciated. To boil it down, I want a regex that can capture the value of an src or href attribute that is wrapped in escaped quotation marks and has either a pdf, png, jpg or gif extension.
My migration script is written in C#