Problem with regex for text parsing (similar to textile)

Question

I'm banging my head against the wall trying to figure out a (regexp?) based parser rule for the following problem. I'm developing a text markup parser similar to textile (using PHP), but i don't know how to get the inline formatting rules correct -- and i noticed, that the textile parsers i found are not able to format the following text as i would like to get it formatted:

-*deleted* -- text- and -more deleted text-

The result I want to have is:

<del><strong>deleted</strong> -- text</del> and <del>more deleted text</del>

What I do not want is:

<del><strong>deleted</strong> </del>- text- and <del>more deleted text</del>

Any ideas are very appreciated! thanks very much!

UPDATE

i think i should have mentioned, that '-' should still be a valid character (hyphen) :) -- for example the following should be possible:

-american-football player-

expected result:

<del>american-football player</del>

Why not use Textile or Markdown in the first place? Saves you time and trouble. — Gordon, Jul 14 '10 at 07:30
because, the implementations i found apparently have limitations in formatting. i too dislike several formatting rules -- i need a mixture of textile, markdown and restructured text -- something that behaves 100% how i would like it to behave :-) — aurora, Jul 14 '10 at 09:15
I think you need to employ some artificial intelligence with mind reading capabilities ;) How do you guess to parse `-american-football-player` for example? You need to formulate general, clear and consistent parsing rules in human language or in a collection of examples and only then try to translate them to regexps or whatever. — Rorick, Jul 14 '10 at 10:23
@rorick: i think my rules are quite easy for a human to formulate: opening '-' at the beginning of a sentence and/or when there's a whitespace before. closing '-': end of sentence or following whitespace. i don't think that you need special AI for this, however: it might be the case, that my rules would be to complex to achieve with regex, but this was part of my question. — aurora, Jul 14 '10 at 12:05
@harald - Rorick pointed out, correctly, that you're missing a few input/output examples in your question, and showed another edge case. That said, is the latest version I've posted working for you? — Kobi, Jul 14 '10 at 12:11
@harald: So `-american-football-player` results in the same string. And what about `-american-football -player-`? Should it be `~~american-football -player~~` or `-american-football ~~player~~` or another string? — Rorick, Jul 14 '10 at 12:38
@rorick: sorry, maybe i misunderstood your first comment. to your question: i would expect it to be ~~american-football -player~~, because in my opinion this much more makes sense than anything else i can think of. ... — aurora, Jul 14 '10 at 14:38

Markus Jarderot · Answer 1 · 2010-07-14T17:14:17.293

Based of the RedCloth library's parser description, with some modification for double-dash.

@
  (?<!\S)               # Start of string, or after space or newline
  -                     # Opening dash
  (                     # Capture group 1
    (?:                 #   : (see note 1)
      [^-\s]+           #   :
      [-\s]+            #   :
    )*?                 #   :
    [^-\s]+?            #   :
  )                     # End
  -                     # Closing dash
  (?![^\s!"\#$%&',\-./:;=?\\^`|~[\]()<])  # (see note 2)
@x

Note 1: This should match up to the next dash lazily, while consuming any non-single dashes, and single dashes surrounded by whitespace.
Note 2: Followed by space, punctuation, line break or end of string.

Or compacted:

@(?<!\S)-((?:[^-\s]+[-\s]+)*?[^-\s]+?)-(?![^\s!"#$%&',\-./:;=?\\^`|~[\]()<])@

A few examples:

$regex = '@(?<!\S)-((?:[^-\s]+[-\s]+)*?[^-\s]+?)-(?![^\s!"#$%&\',\-./:;=?\\\^`|~[\]()<])@';
$replacement = '<del>\1</del>';

preg_replace($regex, $replacement, '-*deleted* -- text- and -more deleted text-'), "\n";
preg_replace($regex, $replacement, '-*deleted*--text- and -more deleted text-'), "\n";
preg_replace($regex, $replacement, '-american-football player-'), "\n";

Will output:

<del>*deleted* -- text</del> and <del>more deleted text</del>
<del>*deleted*</del>-text- and <del>more deleted text</del>
<del>american-football player</del>

In the second example, it will match just -*deleted*-, since there are no spaces before the --. -text- will not be matched, because the initial - is not preceded by a space.

Kobi · Answer 2 · 2010-07-14T08:13:12.027

1

For a single token, you can simply match:

-((?:[^-]|--)*)-

and replace with:

<del>$1</del>

and similarly for \*((?:[^*]|\*{2,})*)\* and <strong>$1</strong>.

The regex is quite simple: literal - in both ends. In the middle, in a capturing group, we allow anything that isn't an hyphen, or two hyphens in a row.

To also allow single dashes in words, as in objective-c, this can work, by accepting dashes surrounded by two alphanumeric letters:

-((?:[^-]|--|\b-\b)*)-

edited Jul 14 '10 at 08:13

answered Jul 14 '10 at 07:28

Kobi

135,331
41
252
292

Ok, for this example it works -- but '-' should be still a valid character in the text. for example "-objective-c-" should become "~~objective-c~~". – aurora Jul 14 '10 at 08:07
1

@harald - well, you didn't mention you need it :) – Kobi Jul 14 '10 at 08:11

Alix Axel · Answer 3 · 2010-07-14T08:00:56.220

1

The strong tag is easy:

$string = preg_replace('~[*](.+?)[*]~', '<strong>$1</strong>',  $string);

Working on the others.

Shameless hack for the del tag:

$string = preg_replace('~-(.+?)-~', '<del>$1</del>', $string);
$string = str_replace('<del></del>', '--', $string);

edited Jul 14 '10 at 08:00

answered Jul 14 '10 at 07:53

Alix Axel

151,645
95
393
500

1

That'd be `str_replace('~~', '--', $string);`. I guess that's the problem with hacks :)~~ – Kobi Jul 14 '10 at 08:16
@Kobi: Oh! Didn't even noticed that! Your solution is way better and the OP should use it. I had a very similar one but couldn't get the non-capturing group to work... I'm out of patience today - been awake for 22 hrs. :P – Alix Axel Jul 14 '10 at 08:20

score 0 · Answer 4 · answered Jul 14 '10 at 07:29

0

You could try something like:

'/-.*?[^-]-\b/'

Where the ending hyphen must be at a word boundary and preceded by something that is not a hyphen.

answered Jul 14 '10 at 07:29

Josiah

4,754
1
20
19

score 0 · Answer 5 · edited May 23 '17 at 12:03

0

I think you should read this warning sign first You can't parse [X]HTML with regex

Perhaps you should try googling for a php html library

edited May 23 '17 at 12:03

Community

1
1

answered Jul 14 '10 at 07:51

Sjuul Janssen

1,772
1
14
28

A valid comment would be that you cannot match nested quotes, or in this case `*` and `-`, for example `- aa * bb - cc - bb * aa-`. – Kobi Jul 14 '10 at 08:07

Problem with regex for text parsing (similar to textile)

5 Answers5