1

This question may have been asked in a different way, if so please point it to me. I just couldn't find it among my search results.

I would like to parse text for mark-ups, like those here on SO.

  1. eg. * some string for bullet list
  2. eg. *some string* for italic text
  3. eg. &some string& for a URL
  4. eg. &some string&specific url& for URL different from string

etc.

I can think of two ways to go about processing a string to find out special mark-up sequences:

a. I could proceed in a character-centric way, i.e. parsing the string looking for sequences 1, then 2 etc. That however seems to be inefficient as it would have to parse the string multiple times.

b. It seems better to process the string character by character and keep a memory of special characters and their position. If the memory matches a special sequence as above, then the special characters are replaced by HTML in the string. I'm not really sure whether that's a better idea however, nor am I sure as to how one should implement it.

What is the best way to go about this? How about Regular Expressions? Does it follow pattern a or b? Is there a third option?

P.S. I am using Python. Python example most appreciated.

Ry-
  • 218,210
  • 55
  • 464
  • 476
neydroydrec
  • 6,973
  • 9
  • 57
  • 89
  • I think regular expressions would probably be the easiest way. Not the most efficient, but unless you're processing several-hundred-kilobyte documents, you probably shouldn't have any problems. – Ry- Apr 22 '12 at 17:40
  • Are you sure you want to do this? – PeeHaa Apr 22 '12 at 17:40
  • @minitech: I want to store documents with their markups and would like to be able to load the HTML translation without experiencing delays. Pages shouldn't get that big however (but its size will depend on the end user). – neydroydrec Apr 22 '12 at 17:46
  • @RepWhoringPeeHaa: your formulation is not helpful, what are you implying? – neydroydrec Apr 22 '12 at 17:46

2 Answers2

1

You're essentially trying to implement a lexical analyser or 'lexer'. You can try searching 'lexer', 'parser', 'markup' for further reading material. [Edit: I may mean "parser", not "lexer". A lexer is a part of a parser.]

Parsers are commonly implemented using regular expressions as part of the solution, but there's a bit more to it than that.

If you're doing this for Markdown specifically, are you sure you don't want to use an existing Markdown parser/lexer? There are some very fast and well-tested Markdown parsers already in existence.


Sidenote: please try not to roll your own markup syntax - there are dozens of plain-text markup languages already. Pick one you like and use it. Wikipedia formatting, Markdown, and others come to mind. There are ready-made tools for parsing these.

Community
  • 1
  • 1
Li-aung Yip
  • 12,320
  • 5
  • 34
  • 49
  • +1, I would check this link http://en.wikipedia.org/wiki/Lightweight_markup_language and choose the closest match. – Juha Apr 23 '12 at 08:47
-1

Regular expressions, of course! If still haven't done so, learn it. After you are done, you will find it hard to imagine how you got along without it. The samples you show are simple with regular expressions. For example, an asterisk, then a space then a word is expressed as:

\*\s\w+

Nothing else but regular expressions.

Israel Unterman
  • 13,158
  • 4
  • 28
  • 35
  • I have used RE, but never in a search and replace and never with multiple possible sequences using around a same character. I will try your suggestion. Thanks. – neydroydrec Apr 22 '12 at 17:48