What is the best way to remove uncommon unicode characters. I'm building an app that is using Latex for generating PDF files. Latex does not really work well with unicode characters. It supports common utf8 characters found in most languages, but if the string contains some really weird characters, it breaks. Since the content is from a web scraper, those characters are probably there by mistake.
The string.unicode_normalize(:nfkc)
does not really help if it's a weird char, for example (U+F000):
2.3.1 :081 > "".unicode_normalize(:nfkc)
=> ""
I found something about unpacking and packing string to it's ord
values here, so I convert all chars to integers and discard any char larger than 4096.
out.unpack('U*').map{|x| x<4096 ? x : 63}.pack('U*')
This works, but I'm wondering if there is a smarter way.