I'm working with PHP, getting html from websites, converting them to plain text and saving them to the database.
They need to be saved to the database in utf-8. My first problem is that I don't know the original encoding, what's the best way to encode to utf-8 from an unknown encoding?
the 2nd issue is the html to plain text conversion. I tried using html2text but it messed up all the foreign utf characters.
What is the best approach?
Edit: It seems the part about plain text is not clear enough. What i need not to just strip the html tags. I want to strip the tags while maintaining a kind of document structure. <p>
, <li>
tags would convert to line breaks etc and tags like <script>
would be completely removed with their content.