-1

I am in a weird scenerio where I need to show the content in multiple columns. I am using css3 column-cont and jquery plugin columnizer for older versions of IE. The problem is that I do not have complete control over the data as it is served by an external webservice. In most cases the content is wrapped in multiple paragraph tabs

Content#1

 <p><strong>Heading</strong><br>This is a content</p>
 <p><strong>Heading</strong><br>This is a content</p>

But In few cases the data is not wrapped in <p> tag and looks like below:

Content#2

<strong>Day 1: xyz </strong><br>
 lorem lipsum <br> <br> 
<strong>Dag 2: lorem lipsum</strong><br> 
Morgonflyg till Arequipa i södra Peru.
<br> <br> 

The real problem is jquery columnizer plugin hangs up the browser with this markup when it is asked to columnize such content.

Now I want to transform Content#2 to Content#1 with the help of regular expression,ie wrap the contents into sensible paragraphs. I hope I have made myself clear I am using PHP.

Thank you in advance!

Dipesh KC
  • 3,195
  • 3
  • 32
  • 50
  • 2
    Do _not_ parse HTML with regex. Use [DOMDocument](http://php.net/manual/en/class.domdocument.php) – Alma Do Sep 26 '13 at 11:05
  • What have you tried? Also, in order to approach a task like this, you need to define the logic by which a regular expression pattern might be built. What are the rules as to what it should match? That said, REGEX is normally a poor choice when it comes to parsing mark-up. You might be better off with PHP's DOMDocument class, though if your mark-up is invalid you might struggle. – Mitya Sep 26 '13 at 11:05
  • @AlmaDoMundo please give me some hint, I'm not just trying to parse here – Dipesh KC Sep 26 '13 at 11:08
  • You'll also have to define how is a "sensible paragraph". – Passerby Sep 26 '13 at 11:11
  • Which WYSIWYG editor u r using to manage content... content comes from DB right? – Codesen Sep 26 '13 at 11:13
  • @Utkanos I just thought regular expression would get solution for this problem but I don't know how to start – Dipesh KC Sep 26 '13 at 11:13
  • @Codesen I don't have any idea about which WYSIWYG editor is being used, I'm using the data from the webservice, and I don't have the control over it. I'm just trying to be safe from my side – Dipesh KC Sep 26 '13 at 11:14
  • Is there any common pattern of the content you are getting? – Rohit Choudhary Sep 26 '13 at 11:16
  • sample content from WS – Codesen Sep 26 '13 at 11:16
  • `Dag 1: Avresa från Skandinavien till Lima
    Flyg till Lima med korti Amsterdan. Ankomst på kvällen till Lima. [Måltider på flyget]

    Dag 2: Världsarvstaden Lima och middag på piren vid Stilla havet
    Upptäcktsfärd till fots genom

    Dag 3: Arequipa och Santa Catalinaklostret
    Eftermiddagen fri för egna strövtåg i de vackra omgivningarna. [F]

    Dag 4: Genom Colcadalen till Chivay
    Dagen

    Dag 5: Längs högplatån Altiplano till Puno vid Titicacasjön.
    Dagen inleds med be`
    – Dipesh KC Sep 26 '13 at 11:23
  • @RohitKumarChoudhary In case of faulty data:heading
    content

    this pattern gets repeated
    – Dipesh KC Sep 26 '13 at 11:25
  • Tags on different lines is a perfect example of why we say **don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Sep 26 '13 at 12:20

2 Answers2

1

Your content is not stable and Regular Expression won't do magics with distinct contents like this. With this being said, whenever you're receiving the data from the other website, there might be a high chance that someday it'll return different pattern so your rules won't be good anymore. You need to have a reliable source to get a reliable result.

This is a filthy string manipulation but it'll get what you need if the pattern stays consistent. And, I still insist that you have to use a reliable source.

$str = "<strong>Day 1: xyz </strong><br>
 lorem lipsum <br> <br>
<strong>Dag 2: lorem lipsum</strong><br>
Morgonflyg till Arequipa i södra Peru.
<br> <br> ";

function parse($data)
{
  if(substr($data, 0, 3) == "<p>") return $data;

  $chunks = explode("<strong>", $data);
  $out = array();

  foreach($chunks as $chunk)
  {
    $item = $chunk;

    $last_br = strpos($item, "<br> <br>");
    if($last_br > -1){ $item = substr($item, 0, $last_br); }

    $item = "<p>" . $item . "</p>";

    $out[] = $item;
  }

  return implode("\n", $out);
}

echo parse($str);
MahanGM
  • 2,352
  • 5
  • 32
  • 45
  • So what and how do you suggest to get rid of the problem. So far, there are only two patterns of the data received from the webservice `

    headingcontent

    ` and `heading
    content

    `
    – Dipesh KC Sep 26 '13 at 11:31
  • @DipeshKc I said it in my answer the part about _reliable_. Anyway, you can somehow do this with string manipulation but it'll take your time. Maybe I'd be able to provide something for you, it'll take a while but I'm going to post it. – MahanGM Sep 26 '13 at 11:33
0

You can use this pattern:

/(?<!^<p>)(<strong>.*?)(<strong>.*)$/gs

Demo

Notice that the exclusion in the negative lookbehind will ONLY work if your strings starts with a <p>... so consider to trim it before applying your regex...

<br> tags has to be removed using another regex or str_replace()

Also, consider maybe using another aproach than Regex to parse DOM HTML...

Enissay
  • 4,969
  • 3
  • 29
  • 56