0

I have the following string:

"<a href=\"/formentries/formfile/13978\" target=\"_blank\">dog-00.jpg|image/jpeg</a>  <a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13978'> [remove]</a><br /><a href=\"/formentries/formfile/13979\" target=\"_blank\">dog-01.docx|application/vnd.openxmlformats-officedocument.wordprocessingml.document</a>  <a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13979'> [remove]</a><br /><a href=\"/formentries/formfile/13980\" target=\"_blank\">dog-02.png|image/png</a>  <a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13980'> [remove]</a>"

If you were to format that nicely, you'll see something like this:

<a href=\"/formentries/formfile/13978\" target=\"_blank\">dog-00.jpg|image/jpeg</a>
<a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13978'> [remove]</a>
<br />

<a href=\"/formentries/formfile/13979\" target=\"_blank\">dog-01.docx|application/vnd.openxmlformats-officedocument.wordprocessingml.document</a>
<a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13979'> [remove]</a>
<br />

<a href=\"/formentries/formfile/13980\" target=\"_blank\">dog-02.png|image/png</a>
<a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13980'> [remove]</a>

So I have a bunch of anchor tags with breaks between them. In each anchor's text, I want to remove the pipe character and the file type:

dog-00.jpg|image/jpeg

becomes

dog-00.jpg

And the regex ought to work for all future file types too, for example:

dog-01.docx|application/vnd.openxmlformats-officedocument.wordprocessingml.document

becomes

dog-01.docx

I still need the full anchors, so after removing the file type, the text becomes:

<a href=\"/formentries/formfile/13978\" target=\"_blank\">dog-00.jpg</a>
<a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13978'> [remove]</a>
<br />

<a href=\"/formentries/formfile/13979\" target=\"_blank\">dog-01.docx</a>
<a href='/FormEntries/Delete' class='btnDeleteAttachment'  data-form-entry-id='366793'  data-attachment-id='13979'> [remove]</a>
<br />

I am not very good at Regex, but I tried various combinations that all failed to match

J86
  • 14,345
  • 47
  • 130
  • 228

3 Answers3

1

Don't use regex to parse complex HTML, you can use HtmlAgilityPack. I'd also use string methods like Contains, IndexOf and Remove instead of regex:

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html); // pass in your HTML string

foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
    string text = link.InnerText;
    if (text.Contains('|'))
        link.InnerHtml = text.Remove(text.IndexOf('|')); // you can't modify InnerText directly but this works
}

string result = doc.DocumentNode.OuterHtml; // your desired result
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
0

Input:
dog-00.jpg|image/jpeg

Regex that matches only the part before the | pipe:
([^|]+)

description:
The above regex matches everything until the first pipe-character occurs.

C# code:

var input = @"dog-00.jpg|image/jpeg";
var regex = new Regex(@"([^|]+)");
var m = regex.Match(input);
string name = null;
if (m.Success)
{
     name = m.Groups[1].Value;
}

EDIT:
If this is only about spliting the string by the pipe-character, Dylan Nicholson's variant with input.Split (or .Substring + .IndexOf) might be more performant that regular expressions...

EDIT2:
Are regular expressions required? If not, try the following:

public static string Clean(string input)
{
    var sb = new StringBuilder(input);
    int m1 = -1, m2 = -1;
    for(var i = 0; i < sb.Length; i++)
    {
        if (sb[i] == '|')
            m1 = i;
        if (sb[i] == '<')
            m2 = i;
        if (m1 > -1 && m2 > -1 && m2 > m1)
        {
            sb.Remove(m1, m2 - m1);
            i = m1;
            m1 = -1;
            m2 = -1;
        }
    }
    return sb.ToString();
}
Michael
  • 1,931
  • 2
  • 8
  • 22
  • Thanks @Michael, I've updated the question. I wish my string to remain the same (e.g. all the anchor tags ..etc), I only want the pipe and the file type that comes after that to be removed. – J86 Dec 19 '17 at 11:08
  • @Ciwan Updated my answer with a non-regex-variant that keeps the html code untouched... – Michael Dec 19 '17 at 11:16
0

Updated

You can use this regex:

(?<=<a[^>]*>[^|]+?)\|.*?(?=</a>)

For C#:

 your_string = Regex.Replace(your_string, "(?<=<a[^>]*>[^|]+?)\\|.*?(?=</a>)", "",
    RegexOptions.IgnoreCase | RegexOptions.Multiline);

Just replace the string using this regex.

karthik selvaraj
  • 426
  • 5
  • 12
  • Updated the question. I do not just want the inside part of the anchor. I want to remove the pipe character and the file type, but leave everything else in the string. – J86 Dec 19 '17 at 11:05
  • @TimSchmelter - regex approach is enough i think. This is one line solution. – karthik selvaraj Dec 19 '17 at 12:23
  • @karthikselvaraj: one line that doesn't work always is not enough. Parsing dynamic html with regex is not very reliable. You know [this](https://stackoverflow.com/a/1732454/28424)? – Tim Schmelter Dec 19 '17 at 12:28
  • @karthikselvaraj your code seems to only return 1 anchor instead of 6 – J86 Dec 19 '17 at 12:31
  • @Ciwan - I have tested the code. please check - https://ideone.com/6GoqCt – karthik selvaraj Dec 19 '17 at 12:55