1

I have been building a function reads in the title text as found on a webpage between the <title></title> tags. I am using the following regex code to grab the title text form the html page:

 if(preg_match('#<title>([^<]+)</title>#simU', $this->html, $m1))
      $this->title = trim($m1[1]);

I am using the following to encode the value for the mysql insert statement:

mysql_real_escape_string(rawurldecode($this->title))

So that leaves me with a database full of titles that have html entities(&nsbp etc...) and foreign characters such as in Dating S.o.s | Gluten-free, Dairy-free, Sugar-free Recipes And Lifestyle Tips

The goal is to decode,remove, clean the titles so that they look as close to perfect english as possible.

I have constructed a function that uses the following 2 regex's to remove html entities and limit junk respectively. And while not ideal(because it removes the html entities rather than preserves them) it's the closest to clean as I've got.

$string = preg_replace("/&#?[a-z0-9]+;/i","",$string);
//remove all non-normal chars
$string = preg_replace('/[^a-zA-Z0-9-\s\'\!\,\|\(\)\.\*\&\#\/\:]/', '', $string);

But the non-english chars still exist.

Would anyone be able to offer help as to:

  1. Best way to save these title strings to the db trying to preserve the english intent (punctuation, apostrophies, etc...)
  2. How to convert or eliminate the strange chars as shown in my example title above?

Thanks much for your help!

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
user603424
  • 43
  • 3
  • Please use the following link, it will make your life much much easier http://stackoverflow.com/questions/3577641/best-methods-to-parse-html – RobertPitt Feb 11 '11 at 19:44

2 Answers2

1

For point 1, PHP has an html_entity_decode() function that you can use to turn HTML entities into "regular" characters.

CanSpice
  • 34,814
  • 10
  • 72
  • 86
  • Given my process, would you store the entities in mysql as currently done and use the decode when displaying the value to the user? – user603424 Feb 11 '11 at 19:39
  • 1
    If you're always going to be displaying the decoded text, you might as well only do the decode once when you store it in the database. That way you don't have to remember to call `html_entity_decode()` every time you want to display some text, which will save a little bit of computational time. – CanSpice Feb 11 '11 at 19:42
  • The mbstring functions listed below were very helpful in getting rid of some foreign chars but I still am having trouble with the html entities. For example the following: $var = 'Freedom's Ramparts | Enjoying life in Wellington, NZ'; $var = html_entity_decode($var); echo $var.'
    '; Still returns a string with the html code in it rather than the parentheses.
    – user603424 Feb 12 '11 at 17:18