What is the most efficient way to trace markups in a string?

Question

This question may have been asked in a different way, if so please point it to me. I just couldn't find it among my search results.

I would like to parse text for mark-ups, like those here on SO.

eg. * some string for bullet list
eg. *some string* for italic text
eg. &some string& for a URL
eg. &some string&specific url& for URL different from string

etc.

I can think of two ways to go about processing a string to find out special mark-up sequences:

a. I could proceed in a character-centric way, i.e. parsing the string looking for sequences 1, then 2 etc. That however seems to be inefficient as it would have to parse the string multiple times.

b. It seems better to process the string character by character and keep a memory of special characters and their position. If the memory matches a special sequence as above, then the special characters are replaced by HTML in the string. I'm not really sure whether that's a better idea however, nor am I sure as to how one should implement it.

What is the best way to go about this? How about Regular Expressions? Does it follow pattern a or b? Is there a third option?

P.S. I am using Python. Python example most appreciated.

I think regular expressions would probably be the easiest way. Not the most efficient, but unless you're processing several-hundred-kilobyte documents, you probably shouldn't have any problems. — Ry-, Apr 22 '12 at 17:40
@minitech: I want to store documents with their markups and would like to be able to load the HTML translation without experiencing delays. Pages shouldn't get that big however (but its size will depend on the end user). — neydroydrec, Apr 22 '12 at 17:46
@RepWhoringPeeHaa: your formulation is not helpful, what are you implying? — neydroydrec, Apr 22 '12 at 17:46

score 1 · Accepted Answer · edited May 23 '17 at 11:49

You're essentially trying to implement a lexical analyser or 'lexer'. You can try searching 'lexer', 'parser', 'markup' for further reading material. [Edit: I may mean "parser", not "lexer". A lexer is a part of a parser.]

Parsers are commonly implemented using regular expressions as part of the solution, but there's a bit more to it than that.

If you're doing this for Markdown specifically, are you sure you don't want to use an existing Markdown parser/lexer? There are some very fast and well-tested Markdown parsers already in existence.

Sidenote: please try not to roll your own markup syntax - there are dozens of plain-text markup languages already. Pick one you like and use it. Wikipedia formatting, Markdown, and others come to mind. There are ready-made tools for parsing these.

+1, I would check this link http://en.wikipedia.org/wiki/Lightweight_markup_language and choose the closest match. — Juha, Apr 23 '12 at 08:47

score -1 · Answer 2 · answered Apr 22 '12 at 17:43

-1

Regular expressions, of course! If still haven't done so, learn it. After you are done, you will find it hard to imagine how you got along without it. The samples you show are simple with regular expressions. For example, an asterisk, then a space then a word is expressed as:

\*\s\w+

Nothing else but regular expressions.

answered Apr 22 '12 at 17:43

Israel Unterman

13,158
4
28
35

I have used RE, but never in a search and replace and never with multiple possible sequences using around a same character. I will try your suggestion. Thanks. – neydroydrec Apr 22 '12 at 17:48

What is the most efficient way to trace markups in a string?

2 Answers2