1

I need to convert html text into bbcodes. Where i can find how should i do this? For example, I convert links:

 regex = new Regex("<a href=\"(.+?)\">(.+?)</a>");
 htmlCode = regex.Replace(htmlCode, "[URL]$1[/URL]");

How can i convert all html tags in bbcodes (and replace to empty which isn't bb codes, tag P

Dmitriy
  • 552
  • 1
  • 6
  • 20
  • 5
    [You cannot parse HTML using regular expressions!](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – SLaks Apr 25 '10 at 16:13
  • i've read this post, i know. But. I have user input. and user will input to programm html in normal format. I make user input format. i hope you understand me.. =) – Dmitriy Apr 25 '10 at 16:19
  • You can still work on user input using DOM manipulation libraries. And in any case, it's impossible to losslessly convert from HTML to BBCode, since the latter doesn't support everything that the former does. – Max Shawabkeh Apr 25 '10 at 16:26
  • You cannot even parse normal HTML using regular expressions. – SLaks Apr 25 '10 at 16:29
  • i can lost any tags, for example: color, hr, span (not data) etc. – Dmitriy Apr 25 '10 at 16:35
  • You still can't. You cannot parse any nested structure using regular expressions. – SLaks Apr 25 '10 at 16:37

3 Answers3

3

Rather than use Regexs (which cannot ever ever ever parse HTML), try using HtmlAgilityPack to search down the DOM tree and change the relevant HTML tags into BBCode. Making a new valid BBCode document would seem to be the hardest part of this - maybe there is some library to help make valid BBCode markup somewhere?

Callum Rogers
  • 15,630
  • 17
  • 67
  • 90
2

For some HTML tags, you can just do a simple string.Replace. BBCode is in many ways just a 1:1, tag-for-tag mapping, for example <b> and </b> mapping to [B] and [/B] respectively. So that's easily accomplished with just:

html.Replace("<b>", "[b]").Replace("</b>", "[/b]")

If it's really dead-simple HTML, and you don't mind the performance impact and code ugliness of doing this tag-by-tag, go for it. But beware of cross-site scripting vulnerabilities, if you plan to display the resulting BBCode on a web page somewhere; this is nowhere near good enough for sanitization.

But don't even bother trying to use regular expressions to sanitize the HTML and do automatic replacement of all tags. The <img> tag, for instance, looks completely different in HTML vs. BBCode. In HTML it's <img src="..."/> (trailing slash is optional) and in BBCode it's [IMG]...[/IMG]. Doing this with regex is... well, let's just say sub-optimal.

Regular expressions are designed for regular languages, and HTML is not a regular language, it's a context-free language. Consider using an actual HTML parser instead like the HTML Agility Pack. Then you can descend the DOM tree, whitelist the elements you want, and map them to BBCode or anything else however you like.

Aaronaught
  • 120,909
  • 25
  • 266
  • 342
0

I know your suppose to use a tool built for parsing the DOM aka HtmlAgilityPack but I needed something that could use the tools built into .net and not have to reference an external dll.

So I wrote a converter in c# that does it all through RegEx.

Here's my write-up http://www.foliotek.com/devblog/convert-html-to-bbcode-in-c/

bigamil
  • 657
  • 4
  • 12