2

I'm banging my head against the wall trying to figure out a (regexp?) based parser rule for the following problem. I'm developing a text markup parser similar to textile (using PHP), but i don't know how to get the inline formatting rules correct -- and i noticed, that the textile parsers i found are not able to format the following text as i would like to get it formatted:

-*deleted* -- text- and -more deleted text-

The result I want to have is:

<del><strong>deleted</strong> -- text</del> and <del>more deleted text</del>

What I do not want is:

<del><strong>deleted</strong> </del>- text- and <del>more deleted text</del>

Any ideas are very appreciated! thanks very much!

UPDATE

i think i should have mentioned, that '-' should still be a valid character (hyphen) :) -- for example the following should be possible:

-american-football player-

expected result:

<del>american-football player</del>
aurora
  • 9,607
  • 7
  • 36
  • 54
  • 1
    Why not use Textile or Markdown in the first place? Saves you time and trouble. – Gordon Jul 14 '10 at 07:30
  • 1
    because, the implementations i found apparently have limitations in formatting. i too dislike several formatting rules -- i need a mixture of textile, markdown and restructured text -- something that behaves 100% how i would like it to behave :-) – aurora Jul 14 '10 at 09:15
  • I think you need to employ some artificial intelligence with mind reading capabilities ;) How do you guess to parse `-american-football-player` for example? You need to formulate general, clear and consistent parsing rules in human language or in a collection of examples and only then try to translate them to regexps or whatever. – Rorick Jul 14 '10 at 10:23
  • @rorick: i think my rules are quite easy for a human to formulate: opening '-' at the beginning of a sentence and/or when there's a whitespace before. closing '-': end of sentence or following whitespace. i don't think that you need special AI for this, however: it might be the case, that my rules would be to complex to achieve with regex, but this was part of my question. – aurora Jul 14 '10 at 12:05
  • @harald - Rorick pointed out, correctly, that you're missing a few input/output examples in your question, and showed another edge case. That said, is the latest version I've posted working for you? – Kobi Jul 14 '10 at 12:11
  • @harald: So `-american-football-player` results in the same string. And what about `-american-football -player-`? Should it be `american-football -player` or `-american-football player` or another string? – Rorick Jul 14 '10 at 12:38
  • @kobi: looks promising, i'll do some further tests ... – aurora Jul 14 '10 at 14:20
  • @rorick: sorry, maybe i misunderstood your first comment. to your question: i would expect it to be american-football -player, because in my opinion this much more makes sense than anything else i can think of. ... – aurora Jul 14 '10 at 14:38

5 Answers5

2

Based of the RedCloth library's parser description, with some modification for double-dash.

@
  (?<!\S)               # Start of string, or after space or newline
  -                     # Opening dash
  (                     # Capture group 1
    (?:                 #   : (see note 1)
      [^-\s]+           #   :
      [-\s]+            #   :
    )*?                 #   :
    [^-\s]+?            #   :
  )                     # End
  -                     # Closing dash
  (?![^\s!"\#$%&',\-./:;=?\\^`|~[\]()<])  # (see note 2)
@x
  • Note 1: This should match up to the next dash lazily, while consuming any non-single dashes, and single dashes surrounded by whitespace.
  • Note 2: Followed by space, punctuation, line break or end of string.

Or compacted:

@(?<!\S)-((?:[^-\s]+[-\s]+)*?[^-\s]+?)-(?![^\s!"#$%&',\-./:;=?\\^`|~[\]()<])@

A few examples:

$regex = '@(?<!\S)-((?:[^-\s]+[-\s]+)*?[^-\s]+?)-(?![^\s!"#$%&\',\-./:;=?\\\^`|~[\]()<])@';
$replacement = '<del>\1</del>';

preg_replace($regex, $replacement, '-*deleted* -- text- and -more deleted text-'), "\n";
preg_replace($regex, $replacement, '-*deleted*--text- and -more deleted text-'), "\n";
preg_replace($regex, $replacement, '-american-football player-'), "\n";

Will output:

<del>*deleted* -- text</del> and <del>more deleted text</del>
<del>*deleted*</del>-text- and <del>more deleted text</del>
<del>american-football player</del>

In the second example, it will match just -*deleted*-, since there are no spaces before the --. -text- will not be matched, because the initial - is not preceded by a space.

Markus Jarderot
  • 86,735
  • 21
  • 136
  • 138
1

For a single token, you can simply match:

-((?:[^-]|--)*)-

and replace with:

<del>$1</del>

and similarly for \*((?:[^*]|\*{2,})*)\* and <strong>$1</strong>.

The regex is quite simple: literal - in both ends. In the middle, in a capturing group, we allow anything that isn't an hyphen, or two hyphens in a row.

To also allow single dashes in words, as in objective-c, this can work, by accepting dashes surrounded by two alphanumeric letters:

-((?:[^-]|--|\b-\b)*)-
Kobi
  • 135,331
  • 41
  • 252
  • 292
  • Ok, for this example it works -- but '-' should be still a valid character in the text. for example "-objective-c-" should become "objective-c". – aurora Jul 14 '10 at 08:07
  • 1
    @harald - well, you didn't mention you need it :) – Kobi Jul 14 '10 at 08:11
1

The strong tag is easy:

$string = preg_replace('~[*](.+?)[*]~', '<strong>$1</strong>',  $string);

Working on the others.


Shameless hack for the del tag:

$string = preg_replace('~-(.+?)-~', '<del>$1</del>', $string);
$string = str_replace('<del></del>', '--', $string);
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
  • 1
    That'd be `str_replace('', '--', $string);`. I guess that's the problem with hacks :) – Kobi Jul 14 '10 at 08:16
  • @Kobi: Oh! Didn't even noticed that! Your solution is way better and the OP should use it. I had a very similar one but couldn't get the non-capturing group to work... I'm out of patience today - been awake for 22 hrs. :P – Alix Axel Jul 14 '10 at 08:20
0

You could try something like:

'/-.*?[^-]-\b/'

Where the ending hyphen must be at a word boundary and preceded by something that is not a hyphen.

Josiah
  • 4,754
  • 1
  • 20
  • 19
0

I think you should read this warning sign first You can't parse [X]HTML with regex

Perhaps you should try googling for a php html library

Community
  • 1
  • 1
Sjuul Janssen
  • 1,772
  • 1
  • 14
  • 28
  • A valid comment would be that you cannot match nested quotes, or in this case `*` and `-`, for example `- aa * bb - cc - bb * aa-`. – Kobi Jul 14 '10 at 08:07