PCRE - All Unicode Spaces and Newlines

Question

I have following peace of HTML:

      <td>
          <p><span><a href="http://www.someurl.com"><b>
              <span>W Bangkok</span></b></a> <br>
      ‎              106 North Sathorn Road ,Silom, Bangrak‎‎<br>
                    Bangkok, 10500 Thailand‎<br>
                    Phone: (66)(2) 344 4000 Fax: (66)(2) 344 4111<o:p></o:p></span></p>
     </td>

I want to strip of any space, newline, all the invisible characters, basically all but letters and replace them with single space. But I also want to strip of

&nbsp <br /> and <br>

Regex and function I wrote is this:

function clean_data($str)
{
    return trim(preg_replace('/(\p{Zs}|\s|\R|\p{Zl}|\p{Z}|\p{Zp})++/u', ' ', $str));
}

However in the above example looks like HTML breaklines give me trouble. What I get as output is this:

W Bangkok â€Ž106 North Sathorn Road ,Silom, Bangrakâ€Žâ€Ž Bangkok, 10500 Thailandâ€Ž Phone: (66)(2) 344 4000 Fax: (66)(2) 344 4111

How can I write better regural expression to match all those

<br /> and <br>

and everything else which might be a space or newline?

File is saved as UTF-8, when I save it as ASCII I get ? instead of â€Ž

It seems that your code does not process the data as UTF-8. The string “â€Ž” is what you get when you interpret the UTF-8 coded form of U+200E LEFT-TO-RIGHT MARK, i.e. 0xE2 0x80 0x8E, as windows-1252 encoded data. — Jukka K. Korpela, Nov 06 '14 at 09:02
How can I make my code process data as UTF-8? I'm using fread() to read entire page into a variable string and than PHP DOM loadHTML() method. — toni rmc, Nov 06 '14 at 16:41
See http://stackoverflow.com/questions/279170/utf-8-all-the-way-through — Jukka K. Korpela, Nov 06 '14 at 16:46
Thanks man, this helped me in other issue of storing UTF-8 data in database. Looks like I cant influence on the data received and I improved regex to this: '/[^\p{L}\p{N}\p{Nd}\p{Nl}\p{No}\p{P}\p{S}\p{M}]++/u'. When I take a look at the text in Notepad I see there are left-to-right marks there and they cause problems. — toni rmc, Nov 06 '14 at 17:52

PCRE - All Unicode Spaces and Newlines

0 Answers0