1

I'm using redcarpet gem to render some markdown text to html, a portion of the markdown was user inserted, and they typed in a totally valid special character (£), but now when rendering it I get a: Encoding::UndefinedConversionError "\xC2" from ASCII-8BIT to UTF-8

I know it's the £ sign because if I replace it in the text to render then it all works. but they might be inserting other special characters.

I'm not sure how to deal with this, here's my code building the html:

def generate_document
temp_file_service = TempFileService.new
path = temp_file_service.path

template_url = TenantConfig.get('DEPOSIT_GUIDE_TEMPLATE') || DEFAULT_DOC
template = open(template_url, 'rb', &:read)

html = ERB.new(template).result(binding)

File.open( path, 'w') do |f|
  f.write html
end

File.new(path, 'r')
end

the error is risen on the f.write line

here's my html.erb:

   <%= markdown(clause.text) %>

and here's the helper:

def markdown(text)
  Redcarpet::Markdown.new(Redcarpet::Render::HTML).render(text)
end

Note that the encoding problem happens only when saving the html to a file, somewhere else I correctly use the same markdown helper to render the text to the browser, and no problems there.

It would work also the other way, cleaning the markdown code before saving it to DB and replacing any special characters with the corresponding html code (ex. £ becomes &#xA3;)

I tried having a before_save callback (as suggested here: Encoding::UndefinedConversionError: "\xC2" from ASCII-8BIT to UTF-8) :

before_save :convert_text

  private

  def convert_text
    self.text = self.text.force_encoding("utf-8")
  end

which didn't work

I also tried (as recommended here: Using ERB in Markdown with Redcarpet):

        <%= markdown(extra_clause.text).html_safe %>

which didn't work either.

How would I fix either way?

Don Giulio
  • 2,946
  • 3
  • 43
  • 82
  • There is no ASCII-8BIT, so I assume Ruby do not know how to translate such byte into Unicode (just because there is no definition of such byte in any ASCII). "£ becomes £* I would never use such conversion. Long ago we used (sometime, but often wrongly) Latin-1.`` is so unportable and very old. Why not keeping all stack in Unicode? – Giacomo Catenazzi Feb 16 '21 at 13:29

1 Answers1

2

in the end I solved this with adding force_encoding("UFT-8") to the html

like this:

      f.write html.force_encoding("UTF-8")

it fixed it.

Don Giulio
  • 2,946
  • 3
  • 43
  • 82