Discard all uncommon unicode characters from string

Question

What is the best way to remove uncommon unicode characters. I'm building an app that is using Latex for generating PDF files. Latex does not really work well with unicode characters. It supports common utf8 characters found in most languages, but if the string contains some really weird characters, it breaks. Since the content is from a web scraper, those characters are probably there by mistake.

The string.unicode_normalize(:nfkc) does not really help if it's a weird char, for example (U+F000):

2.3.1 :081 > "".unicode_normalize(:nfkc)
 => ""

I found something about unpacking and packing string to it's ord values here, so I convert all chars to integers and discard any char larger than 4096.

out.unpack('U*').map{|x| x<4096 ? x : 63}.pack('U*')

This works, but I'm wondering if there is a smarter way.

Have you looked at these solutions? https://stackoverflow.com/questions/1268289/how-to-get-rid-of-non-ascii-characters-in-ruby. — zakariah1, Jul 28 '22 at 04:21
yeah, not really applicable, as I don't want to convert to ascii. I want to keep characters with accents and similar. I just want to get rid of random cyrillic letter that the crawler (probably mistakenly) picked up. — Levara, Jul 29 '22 at 08:11
Here is what I went with in the end: https://tex.stackexchange.com/a/652355/41953 Since I'm working with latex, I asked the latex version of the question on TeX board. TLDR; I've extracted all the characters that latex does know how to print, I've discarded all invalid and unprintable unicode chars, and then I've used the `unpack.pack` code above to discard all the chars that latex doesn't know how to print. So if anyone stumbles upon this question, my answer is on the TeX board, don't see the point in duplicating it here. — Levara, Jul 29 '22 at 09:10

Discard all uncommon unicode characters from string

0 Answers0