0

Since I don't have any other options on this matter since I can't alter the program, than I need to a way to programmatically remove the percent sign garbage formatting that exists in a line of text:

The query will return a string like this:

'%3CSPAN style='FONT-SIZE: 12pt; FONT-FAMILY: %22Times New Roman%22,%22serif%22; mso-fareast-font-family: %22Times New Roman%22; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA'%3E%3CFONT color=#000000%3E3/20/18: Mrs. McDoogal completed a medical assessment with Dr. John Zoidberg, MD, at Futurama on 4/6/15 and he completed a new substance assessment on 4/14/18.%3C/FONT%3E%3CSPAN style=%22mso-spacerun: yes%22%3E%3CFONT color=#000000%3E  %3C/FONT%3E%3C/SPAN%3E%3CFONT color=#000000%3EMrs. McDoogal is diagnosed with Foobar I diagnosis of Groovy Mind, Foo; Cartoon Dependence; and Fiddling Disorder. %3C/FONT%3E%3CSPAN style=%22mso-spacerun: yes%22%3E%3CFONT color=#000000%3E %3C/FONT%3E%3C/SPAN%3E%3CFONT color=#000000%3EMr. McDoogal is prescribed DDT 30 mg. and LSD 150 mg ABC.%3C/FONT%3E%3CSPAN style=%22mso-spacerun: yes%22%3E%3CFONT color=#000000%3E  %3C/FONT%3E%3C/SPAN%3E%3CFONT color=#000000%3EMr. McDoogal will be enrolled in the day treatment program at Futurama.%3C/FONT%3E%3CSPAN style=%22mso-spacerun: yes%22%3E%3CFONT color=#000000%3E  %3C/FONT%3E%3C/SPAN%3E%3C/SPAN%3E'

I want to strip out stuff like this:

.%3C/FONT%3E%3CSPAN style=%22mso-spacerun: yes%22%3E%3CFONT color=#000000%3E  %3C/FONT%3E%3C/SPAN%3E%3C/SPAN%3E

What is the name of this stuff I want to strip out?

Mojoala
  • 21
  • 3
  • It's not "garbage" it's likely HTML or URL-encoded character values. – rory.ap Jan 12 '16 at 21:31
  • 1
    That looks like a URL encoded string. Check out: http://stackoverflow.com/questions/3778165/unescape-javascripts-escape-using-c-sharp – Daved Jan 12 '16 at 21:39
  • mso tag indicates it's copied directly from a Microsoft Office product, so URL decoding it properly won't be enough, but you'll likely need to do a cleaning function as well, properly using RegEx – Allan S. Hansen Jan 13 '16 at 11:47
  • Daved, the link does not work. – Mojoala Jan 13 '16 at 13:26

1 Answers1

0

If you do a manual search-and-replace on your sample data, using the following values you end up with an HTML fragment.

ValueCharacter
%3C        <        
%3E        >        
%22        "        

Making these substitutions results in the following code, formatted for readability here, but it may be all one line if the original did not include line termination characters.

<SPAN style='FONT-SIZE: 12pt; FONT-FAMILY: "Times New Roman","serif"; mso-fareast-font-family: "Times New Roman"; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA'>
    <FONT color=#000000>3/20/18: Mrs. McDoogal completed a medical assessment with Dr. John Zoidberg, MD, at Futurama on 4/6/15 and he completed a new substance assessment on 4/14/18.</FONT>
    <SPAN style="mso-spacerun: yes">
        <FONT color=#000000>  </FONT>
    </SPAN>
    <FONT color=#000000>Mrs. McDoogal is diagnosed with Foobar I diagnosis of Groovy Mind, Foo; Cartoon Dependence; and Fiddling Disorder. </FONT>
    <SPAN style="mso-spacerun: yes">
        <FONT color=#000000> </FONT>
    </SPAN>
    <FONT color=#000000>Mr. McDoogal is prescribed DDT 30 mg. and LSD 150 mg ABC.</FONT>
    <SPAN style="mso-spacerun: yes">
        <FONT color=#000000>  </FONT>
    </SPAN>
    <FONT color=#000000>Mr. McDoogal will be enrolled in the day treatment program at Futurama.</FONT>
    <SPAN style="mso-spacerun: yes">
        <FONT color=#000000>  </FONT>
    </SPAN>
</SPAN>

You can do this in c# using the String.Replace method:

public static ReplaceGarbage(String garbageString)
{
    return garbageString.Replace(@"%3C", @"<")
                        .Replace(@"3E", @">")
                        .Replace(@"%22", @"""");
}

Then it should be a relatively easy job to remove the tags (if that's what you need) leaving just the body text.

public static string StripTagsRegex(string source)
{
    return Regex.Replace(source, "<.*?>", string.Empty);
}
Evil Dog Pie
  • 2,300
  • 2
  • 23
  • 46
  • Thank you, that was the exact solution I was looking for. – Mojoala Jan 13 '16 at 15:37
  • @Mojoala Welcome to [tag:C#]. The code in my answer is far from perfect, but I'm pleased it helped and, hopefully, has given you a starting point to expand from. – Evil Dog Pie Jan 13 '16 at 15:50