2

I need to match all new line characters outside of a particular html tag or pseudotag.

Here is an example. I want to match all "\n"s ouside of [code] [/code] tags (in order to replace them with <br> tags) in this text fragment:

These concepts are represented by simple Python classes.  
Edit the polls/models.py file so it looks like this: 

[code]  
from django.db import models

class Question(models.Model):
    question_text = models.CharField(max_length=200)
    pub_date = models.DateTimeField('date published') 
[/code]

I know that I should use negative lookaheads, but I'm struggling to figure the whole thing out.

Specifically, I need a PCRE expression, I will use it with PHP and perhaps Python.

zx81
  • 41,100
  • 9
  • 89
  • 105
Mikhail Batcer
  • 1,938
  • 7
  • 37
  • 57

2 Answers2

2

To me, this situation seems to be straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc. Please visit that link for full discussion of the solution.

I will give you answers for both PHP and Python (since the example mentioned django).

PHP

(?s)\[code\].*?\[/code\](*SKIP)(*F)|\n

The left side of the alternation matches complete [code]...[/code] tags, then deliberately fails, and skips the part of the string that was just matched. The right side matches newlines, and we know they are the right newlines because they were not matched by the expression on the left.

This PHP program shows how to use the regex (see the results at the bottom of the online demo):

<?php
$regex = '~(?s)\[code\].*?\[/code\](*SKIP)(*F)|\n~';

$subject = "These concepts are represented by simple Python classes.
Edit the polls/models.py file so it looks like this:

[code]
from django.db import models

class Question(models.Model):
question_text = models.CharField(max_length=200)
pub_date = models.DateTimeField('date published')
[/code]";

$replaced = preg_replace($regex,"<br />",$subject);
echo $replaced."<br />\n";
?>

Python

For Python, here's our simple regex:

(?s)\[code\].*?\[/code\]|(\n)

The left side of the alternation matches complete [code]...[/code] tags. We will ignore these matches. The right side matches and captures newlines to Group 1, and we know they are the right newlines because they were not matched by the expression on the left.

This Python program shows how to use the regex (see the results at the bottom of the online demo):

import re
subject = """These concepts are represented by simple Python classes.  
Edit the polls/models.py file so it looks like this: 

[code]  
from django.db import models

class Question(models.Model):
    question_text = models.CharField(max_length=200)
    pub_date = models.DateTimeField('date published') 
[/code]"""

regex = re.compile(r'(?s)\[code\].*?\[/code\]|(\n)')
def myreplacement(m):
    if m.group(1):
        return "<br />"
    else:
        return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)
Community
  • 1
  • 1
zx81
  • 41,100
  • 9
  • 89
  • 105
  • @MikhailBatcer You're very welcome, it's a nice question. :) Did you have a look at the long explanation in the linked question? I added PHP, PCRE, Python tags to your question (please remember to tag your regex questions like that as mentioned by the tag wiki). – zx81 May 28 '14 at 17:43
1

Here is another solution implemented in python, but without using regex. It also handles nested codeblocks (if necessary) and reads the file line by line, which is extremely useful if you process very large files.

input_file = open('file.txt', 'r')
output_file = open('output.txt', 'w')

    in_code = 0
    for line in input_file:
        if line.startswith('[code]'):
            if in_code == 0:
                line = '\n' + line
            in_code += 1
            output_file.write(line)
        elif line.startswith('[/code]'):
            in_code -= 1
            output_file.write(line)
        else:
            if in_code == 0:
                output_file.write(line.rstrip('\n') + '<br />')
            else:
                output_file.write(line)
miindlek
  • 3,523
  • 14
  • 25