4

I want to query a string (html) from a database and display it on a webpage. The problem is that the data has a

 <p> around the text (ending with </p>

I want to strip this outer tag in my viewmodel or controlleraction that returns this data. what is the best way of doing this in C#?

CRABOLO
  • 8,605
  • 39
  • 41
  • 68
leora
  • 188,729
  • 360
  • 878
  • 1,366

6 Answers6

9

Might be overkill for your needs, but if you want to parse the HTML you can use the HtmlAgilityPack - certainly a cleaner solution in general than most suggested here, although it might not be as performant:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<p> around the text (ending with </p>");
string result = doc.DocumentNode.FirstChild.InnerHtml;
BrokenGlass
  • 158,293
  • 28
  • 286
  • 335
4

If you're absolutely sure the string will always have that tag, you can use String.Substring like myString.Substring(3, myString.Length-7) or so.

A more robust method would be to either manually code the appropriate tests or use a regular expression, or ultimately, use an HTML parser as suggested by BrokenGlass's answer.

UPDATE: Using regexes you could do:

String filteredString = Regex.Match(myString, "^<p>(.*)</p>").ToString();

You could add \s after the initial ^ to remove also leading whitespace. Also, you can check the result of Match to see if the string matched the <p>...</p> pattern at all. This may also help.

Community
  • 1
  • 1
sinelaw
  • 16,205
  • 3
  • 49
  • 80
  • 3
    Reminds me of a dev joke: You have a problem -> you think 'regular expressions' -> now you have two problems ;-) – Jakub Konecki Jan 30 '11 at 22:54
  • Doesn't C#'s Substring() support negative lengths? – ThiefMaster Jan 30 '11 at 22:56
  • @ThiefMaster: Library classes/functions don't differ across languages. Length must be > 0. http://msdn.microsoft.com/en-us/library/aka44szs.aspx – Adam Robinson Jan 30 '11 at 22:59
  • @sinelaw: Come on, why'd you have to start using RegEx? -1 to an otherwise good answer... – Adam Robinson Jan 30 '11 at 22:59
  • ThiefMaster, the second parameter is a length, not an index. – sinelaw Jan 30 '11 at 22:59
  • Adam Robinson: Care to explain why showing how to do it in a more robust way using RegEx is so bad? – sinelaw Jan 30 '11 at 23:00
  • I know it's a length. But for example in PHP you can pass an negative value to make the lenght count from the end of the string which is pretty useful. – ThiefMaster Jan 30 '11 at 23:02
  • @sinelaw: RegEx+HTML=Failure. Seriously, though, the subject's been pretty well beaten to death. As the penultimate SO example, I point you to http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Adam Robinson Jan 30 '11 at 23:02
  • ThiefMaster so in PHP, if it's positive it's the length of the substring, and if it's negative it's the index from the end? Not very nice, I'd be surprised if it isn't index from start (positive) and index from end (when negative). In Python it's that way. The .NET Substring function does not take any index, from beginning or end, it takes the length of the substring which is why a negative value wouldn't make any sense here. – sinelaw Jan 30 '11 at 23:05
  • Adam Robinson: I wasn't recommending parsing HTML using regexes. He/she asked how to remove the

    bracketing the string, nothing more.

    – sinelaw Jan 30 '11 at 23:07
  • @sinelaw: And it'll fail if either P is capitalized...or there's a space before the `/`. Or any number of other issues that make this *not* a regular pattern but an HTML pattern. – Adam Robinson Jan 30 '11 at 23:07
  • Adam Robinson: also the other methods described here will fail in those cases. With a proper regex at least it won't be hard to address many of the possible cases quite easily. I agree that if the text is supposed to be in HTML, the most robust way would be to use an HTML parser although that may be an overkill, depending on the circumstances. – sinelaw Jan 30 '11 at 23:11
  • @sinelaw: RegEx should never be used to process HTML. You should not try to use it to get around the shortcomings of rudimentary string processing rules like those given here. The OP should either use a `Substring` solution or use a parser; if you *need* regex to get around inconsistencies in formatting, then you absolutely *should not use RegEx*. – Adam Robinson Jan 30 '11 at 23:15
  • Adam Robinson: I unknowingly stepped into a meaningless holy war. Your last reply contains no arguments, only claims. As the second answer in the link you posted says, I see no reason not to use RegEx for parsing/stripping near-trivial html snippets from a known source. – sinelaw Jan 30 '11 at 23:26
  • @sinelaw: It's definitely not a meaningless holy war; you're right in that, viewed in a vacuum, HTML is just a string and RegEx is perfectly capable of processing regular patterns in strings and this case *appears* to be such a simple case. The trouble is that if it *is* this simple, then regex itself is overkill and a substring will be faster. If it *isn't* this simple, then it's *highly unlikely* that the variations will be both *substantial enough* to warrant the use of RegEx and also *not too substantial* to preclude its use. – Adam Robinson Jan 30 '11 at 23:32
  • This solution only works with perfect html, with

    tags that have no classes, ids or other attributes, and have a closing

    tag. That is a lot more assumptions than I would like to make about html.
    – KyleWpppd Jan 30 '11 at 23:32
  • 2
    KyleWpppd: I answered the question, i did not give (or intend to give) a method for general parsing of HTML or any subset thereof. – sinelaw Jan 30 '11 at 23:35
  • @sinelaw - wouldn't it be myString.Substring(3, myString.Length-4) (instead of myString.Substring(3, myString.Length-7)) . .why do you have 7 listed ?? – leora Aug 07 '11 at 15:57
  • ooo - you may be right, I don't remember the rationale at the time – sinelaw Aug 10 '11 at 00:15
0

If the data is always surrounded by <p> ... </p>:

string withoutParas = withParas.Substring(3, withParas.Length - 7);
LukeH
  • 263,068
  • 57
  • 365
  • 409
0

Try using string function Remove() passing it the FirstIndex() of <p> and the last index of </p> with length 3

Divi
  • 7,621
  • 13
  • 47
  • 63
0

If you are absolutely guaranteed that you string will always fit the pattern of <p>...</p>, then the other solutions using data.Substring(3, data.Length - 6) are sufficient. If, however, there's any chance that it could look at all different, then you really need to use an HTML parser. The consensus is that the HTML Agility Pack is the way to go.

Adam Robinson
  • 182,639
  • 35
  • 285
  • 343
  • 1
    You don't *really* need to use a HTML-parser for this simple task. If all he wants to do is simply to remove

    before and after, and the strings **always** contains that, it's as simple as a substring. However, if he wants to get more fancy than I'd recommend a HTML parser me too. His question is simple, then the answer should be too as long as that is possible, and if he needs something more, he can ask for that.
    – Alxandr Jan 30 '11 at 22:58
  • 1
    @Alxandr: "

    ...

    ", "

    ..< /p>" etc. Yes, you need to use an HTML parser if you're interacting with HTML.

    – Adam Robinson Jan 30 '11 at 23:05
  • Not if you simply want to remove the beginning and starting p-tag. It all depends on the rule of the system. If this is a system that enforces that whenever you enter data it's wrapped in

    (for some or another reason), then removing those using a HTML parser is overkill. However, once you get to stuff like removing scripts and illegal tags and stuff I completely agree with you though, but that's not what he asked for.
    – Alxandr Jan 30 '11 at 23:10
  • @Alxandr: Point taken; my point was that you can have valid paragraph tags that don't fit the simple pattern. I've edited my answer; thanks! – Adam Robinson Jan 30 '11 at 23:11
  • Yeah, I cinda got that, but still it's a stretch to say that you have to use HAP, though I agree that if the scenario is any more difficult than this I would use it too :) – Alxandr Jan 30 '11 at 23:22
-1
s = s.Replace("<p>", String.Empty).Replace("</p>", String.Empty);
Jonathan Wood
  • 65,341
  • 71
  • 269
  • 466
  • 3
    Maybe I'm wrong, but wouldn't this also replace __every__ instance of `

    ` or `

    ` including any internal ones?
    – jerluc Jan 30 '11 at 22:51
  • i DONT want to replace EVERY instance of

    , just the outer one

    – leora Jan 30 '11 at 22:52
  • If the source is HTML, then

    wouldn't appear within a paragraph. Unless you mean you have multiple paragraphs. That wasn't entirely clear to me.

    – Jonathan Wood Jan 31 '11 at 00:53