- You can try using
mb_convert_encoding
and mb_detect_encoding
instead.
- When importing those documents, you should really want a content encoding or something. If you're indexing from the web, look for the content-type header and the contents of the actual HTML file. Always use this as your primary source - maybe fall back to detecting, but detecting is really just guessing.
- If those two options didn't help, I'd suggest writing your own code to detect invalid characters in the stream. Then just replace these and use
iconv()
.
The reason iconv
doesn't continue after an error is pretty simple: in some character encodings it's important that bytes are read correctly because a character may be based on multiple bytes. UTF-8 compensates this by using a bitmask to detect when the character is done, but not all encodings have this. In such an encoding a single byte being wrong means the rest of the string could be garbled, which isn't what you want. (I'm not entirely sure, but you should be able to replicate this by taking an UTF-16 string and removing the fifth byte in the file)
Hey, I'll even illustrate the issue :-) Below is a (sort of) UTF-16 example which uses 2 bytes per character.
[74 00] [65 00] [73 00] [74 00] = test
Now let's remove a single byte - here it's the first 0x00
[74 65] [00 73] [00 74] [00] = ....
I have no idea what it would actually become, but as you can see it simply breaks up the rest of the string the moment one byte is missing. If you're lucky you'd be indexing in Chinese.