Rails truncate UTF-8 strings containing é (for example)

Question

I am working on a rails 3.1 app with ruby 1.9.3 and mongoid as my ORM. I am facing an annoying issue. I would like to truncate the content of a post like this:

<%= raw truncate(strip_tags(post.content), :length => 200) %>

I am using raw and strip_tags because my post.content is actually handled with a rich text editor.

I have a serious issue with non ASCII characters. Imagine my post content is the following:

éééé éééé éééé éééé éééé éééé éééé éééé

What I am doing above in a naive way does this:

éééé éééé éééé éééé éééé &eac...

Looks like truncate is seeing every word of the string like é&eactute;éé.

Is there a way to either:

Have truncate handle an actual UTF-8 strings, where 'é' stands for a single character ? That would be my favorite approach.
Hack the above instruction such that the result is better, like force rails to truncate between 2 words,

I am asking this question because I have not found any solution so far. This is the only place in my app where I have problems with such character, and it is a major issues since the whole content of the website is in french, so contains a lot of é, ç, à, ù.

Also, I think this behavior is quite unfortunate for the truncate helper because in my case it does not truncate 200 characters at all, but approximately 25 characters !

Good question. How can I check that ? I would bet for &eacute. I am aware that this could a symptom of a much bigger problem, I just wanted to focus my question on something specific and simple. So maybe there is a way to handle the whole thing by making sure mongo stores UTF-8 strings, but I don't know how to check that .. — rpechayr, Feb 09 '12 at 08:47
@muistooshort I updated my question to better reflect the situation. I am in a context of a striped html string. Does it help ? — rpechayr, Feb 09 '12 at 08:51

etolpygo · Accepted Answer · 2014-10-24T17:56:04.277

Probably too late to help with your issue, but... You can use the ActiveSupport::Multibyte::Chars limit method, like so:

post.content.mb_chars.limit(200).to_s

see http://api.rubyonrails.org/v3.1.1/classes/ActiveSupport/Multibyte/Chars.html#method-i-limit

I was having a very similar issue (truncating strings in different languages) and this worked for my case. This is after making sure the encoding is set to UTF-8 everywhere: rails config, database config and/or database table definitions, and any html templates.

score 1 · Answer 2 · edited Aug 21 '16 at 17:26

1

If your string is HTML then I would suggest you check out the truncate_html gem. I've not used it with characters like this but it should be aware of where it can safely truncate the string.

edited Aug 21 '16 at 17:26

edovino

3,315
2
22
22

answered Feb 09 '12 at 10:11

Nick

2,418
16
20

Piotr Murach · Answer 3 · 2018-12-18T21:39:16.437

0

I've written strings to help truncate, align, wrap multibyte text with support for no whitespace languages(Japanese, Chinese etc…)

Strings.truncate('ラドクリフ、マラソン五輪代表に1万m出場にも含み', 12)
# => "ラドクリフ…"

edited Dec 18 '18 at 21:39

answered Feb 15 '15 at 19:09

Piotr Murach

547
6
10

score 0 · Answer 4 · answered Feb 10 '12 at 14:22

There is a simple way, but not a nice solution. First you have to make sure the content you save is UTF-8. This might not necessary.

content = "éééé"
post.content = content.force_encoding('utf-8') unless content.encoding.to_s = "UTF-8"

Then when you read it you can read force it back

<%= raw truncate(strip_tags(post.content.force_encoding('utf-8')), :length => 200) %>

Rails truncate UTF-8 strings containing é (for example)

4 Answers4