Ruby 1.8.6 - process mix of utf8 codes and regular characters

Question

I have a Rails 2.2.2 & Ruby 1.8.6 app which just encountered a weird bug. There's a page which submits a form, and one of the form input values came through in params as "L\001A\001K\0012\0013\0010\0017".

It turns out that the value was copied into the text field from a PDF - in the pdf, it looks like "LAK2307", but when it gets copied into the input, "\001" is inserted between each character. "\001" looks like the utf-8 encoding for the "null char" which is unicode value 1.

I can't prevent people copying this into inputs and submitting them, but i'd like to clean it up before saving to our database. We already convert some fields to ASCII chars before saving, by running the following code on them:

newval = Iconv.iconv('ascii//ignore//translit', 'utf-8', oldval).first

How can i do something similar to this to convert the utf8 chars to a regular char, assuming that's the best way to handle this? In this case i guess i'd just want this to convert "\001" into "", and thus convert "L\001A\001K\0012\0013\0010\0017" to "LAK2307".

thanks, Max

EDIT - changed the name of the question to better describe the problem

EDIT2 - i think that since the problem string is a mix of normal and utf-8 encoded chars, i need to do something like this:

newstring = ""
oldstring.split("").each do |char|
  #test if char is a utf8 string encoded like "\001" (or "\153" etc)
  if char.is_utf8?  #made up method
    newstring << char.unencoded #made up method
  else
    newstring << char
  end
end

there's a couple of pseudocode elements - the methods "is_utf8?" and "unencoded" - can anyone fill in the blanks for these?

may help you - http://stackoverflow.com/questions/5021636/rails-2-3-2-ruby-1-8-6-encoding-question-actioncontroller-returning-utf-8 and http://stackoverflow.com/a/4585339/2767755 — Arup Rakshit, Mar 04 '15 at 11:21
[this lib](http://ruby-doc.org/stdlib-1.8.6/libdoc/nkf/rdoc/String.html#method-i-kconv) will definitely help you out. I'm installing 1.8.6 :) — Arup Rakshit, Mar 04 '15 at 11:23
Could you show the codepoints, chars or bytes for your string? — Stefan, Mar 04 '15 at 11:27
@MaxWilliams The method API is telling it can do it.. so thought it will help. — Arup Rakshit, Mar 04 '15 at 11:28
@Stefan this is what i get from doing `arr = [];s.each_byte{|b| arr << b};arr` => `[76, 1, 65, 1, 75, 1, 50, 1, 51, 1, 48, 1, 55]` — Max Williams, Mar 04 '15 at 11:32
Do you think that instead of trying to convert utf8 strings, i should instead convert to bytes and just remove all instances of `1`? Wondering if char 1 is just a special case after all. — Max Williams, Mar 04 '15 at 11:34

score 0 · Answer 1 · answered Mar 04 '15 at 11:35

0

You want encode I think.

a = "L\001A\001K\0012\0013\0010\0017"
a.encode!("ISO-8859-1", :undef => :replace, :invalid => :replace, :replace => "")
puts a # => LAK2307

answered Mar 04 '15 at 11:35

j-dexx

10,286
3
23
36

That sucks, it goes back as far as 1.9.3 :( – j-dexx Mar 04 '15 at 11:41
I'll see if i can see the source. – Max Williams Mar 04 '15 at 11:52

Ruby 1.8.6 - process mix of utf8 codes and regular characters

1 Answers1