Finding content within HTML, and replacing it

Question

I am currently in the process of export/importing content from one CMS to another.

I have the export in place. I am exporting all content from the old CMS to an XML file, keeping the structure of the documents etc. The import is also in place, mapping to new PageTypes, mapping text fields etc. I have also exported and imported all media from the old to the new CMS.

My only concern left, is handling internal links, and links to media items, within the RichText field of each page.

So, each page consist of a Header, some generic info, and a RichTextField, containing the page content i HTML. This field, can contain links to other pages within the same site, so internal links, and link to media items.

My question is, how can i find these, and map them to my new structure.

All internal links look like this: <a href="/mycms/~/link.aspx?_id=D9423CEFED254610A5DC6B096A297E17&_z=z">...</a> (maybe there could be more properties on some links, like style="..", class=".." etc. The ID, is a reference to the ID of the old CMS, and it is always 32 charactars long.

The media items (images) could look like this: <img src="/mycms/~/media/B1FB91AC357347BD84913D56B8791D03.ashx" alt="" width="690" height="202" />. Also here, the id is always 32 characters long.

During the import, i generated a json file, containing all mediaId's from the old CMS, mapping it to the new ID in the new CMS. So it looks like this;

{
    "{0CFBBD0A-9156-4AD9-8A8A-7D30B2D7213B}":1095,
    "{BE9BEAAA-F04D-42DA-B52A-44B4B31A389E}":1096,
    etc.
}

Notice the format of the ID of the old CMS id, is different from the one used in the links and media. Stripping it of curly braces and dashes, and it will match.

What would be the best way to go about this? I am guessing a RegEx would be the way to go - but what would/could that look like?

Thanks :)

[Don't use regex to parse html](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). — Erik Philips, Jan 25 '18 at 19:05

nikib3ro · Answer 1 · 2018-01-25T23:19:06.380

Your best bet would be using something like HtmlAgilityPack. Pure Regex is usually too crude to parse HTML successfully... not impossible task, but way harder one than using HtmlAgilityPack.

The post Eric linked in his comment is infamous one in history StackOverflow and multiple replies there go into more details on why parsing HTML with Regex is not recommended approach. To provide TLDR from my personal experience: HTML pages are often full of small "errors". For example you'll often have <img> tags that are not closed properly (like <img />). Deterministic matching and replacing is also quite hard.

So, try to use right tool for the job - in this case the right tool is HtmlAgilityPack.

When it comes to the usage of HtmlAgilityPack - they have good documentation. In your case you'll likely want to take a look at Replace Child functionality. To reproduce example from their docs, here is test HTML used:

<body>
    <h1>This is <b>bold</b> heading</h1>
    <p>This is <u>underlined</u> paragraph</p>
</body>

To manipulate this, and replace <h1> node you would do:

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html); // where html = @"content previously mentioned"

var htmlBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
HtmlNode oldChild = htmlBody.ChildNodes[1];     
HtmlNode newChild = HtmlNode.CreateNode("<h2> This is h2 new child heading</h2>");      

htmlBody.ReplaceChild(newChild, oldChild);
// now htmlBody has <h2> node instead of old <h1>

In your case you'll likely want to use SelectNodes instead of SelectSingleNode where with XPath you'll target elements you want to replace. Once you have those elements in list you'll iterate them and replace content depending on conditions.

One thing to keep in mind - since your IDs are quite verbose with 32 chars, you are likely to match them with pure string search. So if you are NOT targeting certain HTML elements, but rather IDs - then you don't even need to use HtmlAgilityPack or Regex - do simple String.Replace("OLDUID", "NEWUID").

I strongly agree with kape123 on the use of the HtmlAgilityPack in this scenario instead of brute force Regex. I just did a migration project for a client with a similar scenario - old CMS into Sitecore - and I had to do a lot of HTML clean-up and manipulation when importing the content into Sitecore including properly setting internal and media links. I don't think I could have accomplished the task without the HtmlAgilityPack. — DougCouto, Jan 25 '18 at 19:30
Thanks for the suggestion @kape123. Any ideas as to how I, with AgilityPack, can find the links and images, that matches my pattern, with the 32 characters ID? — brother, Jan 25 '18 at 20:38

score 0 · Answer 2 · 2018-01-28T19:26:30.423

If you are mixing non-html with html, it's best to use regex.
Here is a way to do the substitutions.

Links:

(?i)(<a)(?=((?:[^>"']|"[^"]*"|'[^']*')*?\shref\s*=\s*(['"])/mycms/~/link\.aspx\?_id=)([a-f0-9]{32})(&_z=z\3(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

Replace with $1$2 + key{$4} + $5
where key{$4} is the new link ID value from the dictionary.

https://regex101.com/r/xRf1xN/1

 # https://regex101.com/r/ieEBj8/1

 (?i)                              # Case insensitive modifier
 ( < a )                           # (1), The a tag

 (?=                               # Asserttion (a pseudo atomic group)

      (                                 # (2 start), Up to the ID num
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?

           \s href \s* = \s*                 # href attribute
           ( ['"] )                          # (3), Quote
           /mycms/~/link\.aspx\?_id=         # Prefix link static text
      )                                 # (2 end)

      ( [a-f0-9]{32} )                  # (4), hex link ID

      (                                 # (5 start), All past the ID num
           &amp;_z=z                         # Postfix link static text
           \3                                # End quote

                                             # The remainder of the tag parts
           (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
           > 
      )                                 # (5 end)

 )
                                   # All the parts have already been found via assertion
                                   # Just match a normal tag closure to advance the position
 \s+                               
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 >

Media:

(?i)(<img)(?=((?:[^>"']|"[^"]*"|'[^']*')*?\ssrc\s*=\s*(['"])/mycms/~/media/)([a-f0-9]{32})(\.ashx\3(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

Replace with $1$2 + key{$4} + $5
where key{$4} is the new media ID value from the dictionary.

https://regex101.com/r/pwyjoK/1

 # https://regex101.com/r/ieEBj8/1

 (?i)                              # Case insensitive modifier
 ( < img )                         # (1), The img tag

 (?=                               # Asserttion (a pseudo atomic group)

      (                                 # (2 start), Up to the ID num
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?

           \s src \s* = \s*                  # src attribute
           ( ['"] )                          # (3), Quote
           /mycms/~/media/                   # Prefix media static text
      )                                 # (2 end)

      ( [a-f0-9]{32} )                  # (4), hex media ID

      (                                 # (5 start), All past the ID num
           \.ashx                            # Postfix media static text
           \3                                # End quote

                                             # The remainder of the tag parts
           (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
           > 
      )                                 # (5 end)

 )
                                   # All the parts have already been found via assertion
                                   # Just match a normal tag closure to advance the position
 \s+                               
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 >

If i wanted to a) extract the ID within the link/src tag and b) replace the entire href=".." or src=".." value (and not hust the ID part, how would that look in RegEx?

To do this, just rearranges the capture groups.

Links:

(?i)(<a)(?=((?:[^>"']|"[^"]*"|'[^']*')*?\s)(href\s*=\s*(['"])/mycms/~/link\.aspx\?_id=([a-f0-9]{32})&_z=z\4)((?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

Replace with $1$2href='NEWID:key{$5}'$6
where key{$5} is the new link ID value from the dictionary.

https://regex101.com/r/FxpJVl/1

 (?i)                              # Case insensitive modifier
 ( < a )                           # (1), The a tag

 (?=                               # Asserttion (a pseudo atomic group)

      (                                 # (2 start), Up to the href attribute
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
           \s 
      )                                 # (2 end)
      (                                 # (3 start), href attribute
           href \s* = \s* 
           ( ['"] )                          # (4), Quote
           /mycms/~/link\.aspx\?_id=         # Prefix link static text


           ( [a-f0-9]{32} )                  # (5), hex link ID


           &amp;_z=z                         # Postfix link static text
           \4                                # End quote
      )                                 # (3 end)
      (                                 # (6 start), remainder of the tag parts

           (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
           > 
      )                                 # (6 end)

 )
                                   # All the parts have already been found via assertion
                                   # Just match a normal tag closure to advance the position
 \s+                               
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 >

Media:

(?i)(<img)(?=((?:[^>"']|"[^"]*"|'[^']*')*?\s)(src\s*=\s*(['"])/mycms/~/media/([a-f0-9]{32})\.ashx\4)((?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

Replace with $1$2src='NEWID:key{$5}'$6
where key{$5} is the new media ID value from the dictionary.

https://regex101.com/r/EqKYjM/1

 (?i)                              # Case insensitive modifier
 ( < img )                         # (1), The img tag

 (?=                               # Asserttion (a pseudo atomic group)

      (                                 # (2 start), Up to the src attribute
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
           \s 
      )                                 # (2 end)
      (                                 # (3 start), src attribute
           src \s* = \s* 
           ( ['"] )                          # (4), Quote
           /mycms/~/media/                   # Prefix media static text

           ( [a-f0-9]{32} )                  # (5), hex media ID

           \.ashx                            # Postfix media static text
           \4                                # End quote
      )                                 # (3 end)
      (                                 # (6 start), remainder of the tag parts

           (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
           > 
      )                                 # (6 end)

 )
                                   # All the parts have already been found via assertion
                                   # Just match a normal tag closure to advance the position
 \s+                               
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 >

Looks great - thank you! If i wanted to a) extract the ID within the link/src tag and b) replace the entire href=".." or src=".." value (and not hust the ID part, how would that look in RegEx? — brother, Jan 26 '18 at 06:47

Finding content within HTML, and replacing it

2 Answers2