2

I'm working with PHP, getting html from websites, converting them to plain text and saving them to the database.

They need to be saved to the database in utf-8. My first problem is that I don't know the original encoding, what's the best way to encode to utf-8 from an unknown encoding?

the 2nd issue is the html to plain text conversion. I tried using html2text but it messed up all the foreign utf characters.

What is the best approach?

Edit: It seems the part about plain text is not clear enough. What i need not to just strip the html tags. I want to strip the tags while maintaining a kind of document structure. <p>, <li> tags would convert to line breaks etc and tags like <script> would be completely removed with their content.

applechief
  • 6,615
  • 12
  • 50
  • 70
  • @AntonioLaguna utf8_encode only converts strings encoded in ISO-8859-1 – applechief Dec 02 '11 at 16:08
  • Not sure exactly what you want from `text/plain` encoding (whether you want to keep the tags, strip the tags, or somewhere in between) ... it might be worth taking a look at HTML Purifier for your conversion though: http://htmlpurifier.org/ – CD001 Dec 02 '11 at 16:59
  • Related: http://stackoverflow.com/questions/1884550/converting-html-to-plain-text-in-php-for-e-mail – Herbert Dec 02 '11 at 17:08

2 Answers2

3
  • Use mb_detect_encoding() for encoding detection.

  • Use strip_tags() to get rid of HTML tags.

Rest of the subjects like formatting the output depends on your needs.

Edit: I don't know if a complete solution exists but this link is really helpful to improve existing html to text PHP scripts on your own.

http://www.phpwact.org/php/i18n/utf-8

Emir Akaydın
  • 5,708
  • 1
  • 29
  • 57
  • mb_detect_encoding seems to be what i'm looking for. but strip tags is not quite it. i need a more advanced library like html2text that would be utf8 friendly. – applechief Dec 02 '11 at 16:07
  • @chaft: html2text is for conversion and _formatting_ text. If it is utf8 friendly, then it shouldn't mess up the characters. Check [this link](http://www.rdeeson.com/weblog/61/using-multi-byte-character-sets-in-php-unicode-utf-8-etc.html) which states "[`strip_tags()`] may be multi-byte safe if you use UTF-8 only (multi-byte UTF-8 characters contain no byte sequences that resemble less-than or greater-than symbols). Avoid UTF-16 & UTF-32, among others." – Herbert Dec 02 '11 at 16:21
  • @EmirAkaydın: I'd +1 your answer again if I could. :) – Herbert Dec 02 '11 at 16:33
  • @Herbert html2text is not utf8 friendly. strip_tags() is not what i am looking for. It indiscriminately strips tags, and might wreck havoc in a text with html tags. + with tags like – applechief Dec 02 '11 at 16:41
1

This function may be useful to you:

<?php
function FixEncoding($x){
  if(mb_detect_encoding($x)=='UTF-8'){
    return $x;
  }else{
    return utf8_encode($x);
  }
}
?>
Balaji Kandasamy
  • 4,446
  • 10
  • 40
  • 58