1

I am working on a rails 3.1 app with ruby 1.9.3 and mongoid as my ORM. I am facing an annoying issue. I would like to truncate the content of a post like this:

<%= raw truncate(strip_tags(post.content), :length => 200) %>

I am using raw and strip_tags because my post.content is actually handled with a rich text editor.

I have a serious issue with non ASCII characters. Imagine my post content is the following:

éééé éééé éééé éééé éééé éééé éééé éééé

What I am doing above in a naive way does this:

éééé éééé éééé éééé éééé &eac... 

Looks like truncate is seeing every word of the string like &eacute;&eactute;&eacute;&eacute;.

Is there a way to either:

  1. Have truncate handle an actual UTF-8 strings, where 'é' stands for a single character ? That would be my favorite approach.
  2. Hack the above instruction such that the result is better, like force rails to truncate between 2 words,

I am asking this question because I have not found any solution so far. This is the only place in my app where I have problems with such character, and it is a major issues since the whole content of the website is in french, so contains a lot of é, ç, à, ù.

Also, I think this behavior is quite unfortunate for the truncate helper because in my case it does not truncate 200 characters at all, but approximately 25 characters !

etolpygo
  • 48
  • 1
  • 6
rpechayr
  • 1,282
  • 12
  • 27
  • Does `post.content` use UTF-8 é or the HTML `é` entity? – mu is too short Feb 09 '12 at 08:43
  • Good question. How can I check that ? I would bet for &eacute. I am aware that this could a symptom of a much bigger problem, I just wanted to focus my question on something specific and simple. So maybe there is a way to handle the whole thing by making sure mongo stores UTF-8 strings, but I don't know how to check that .. – rpechayr Feb 09 '12 at 08:47
  • @muistooshort I updated my question to better reflect the situation. I am in a context of a striped html string. Does it help ? – rpechayr Feb 09 '12 at 08:51

4 Answers4

3

Probably too late to help with your issue, but... You can use the ActiveSupport::Multibyte::Chars limit method, like so:

post.content.mb_chars.limit(200).to_s

see http://api.rubyonrails.org/v3.1.1/classes/ActiveSupport/Multibyte/Chars.html#method-i-limit

I was having a very similar issue (truncating strings in different languages) and this worked for my case. This is after making sure the encoding is set to UTF-8 everywhere: rails config, database config and/or database table definitions, and any html templates.

etolpygo
  • 48
  • 1
  • 6
1

If your string is HTML then I would suggest you check out the truncate_html gem. I've not used it with characters like this but it should be aware of where it can safely truncate the string.

edovino
  • 3,315
  • 2
  • 22
  • 22
Nick
  • 2,418
  • 16
  • 20
0

I've written strings to help truncate, align, wrap multibyte text with support for no whitespace languages(Japanese, Chinese etc…)

Strings.truncate('ラドクリフ、マラソン五輪代表に1万m出場にも含み', 12)
# => "ラドクリフ…"
Piotr Murach
  • 547
  • 6
  • 10
0

There is a simple way, but not a nice solution. First you have to make sure the content you save is UTF-8. This might not necessary.

content = "éééé"
post.content = content.force_encoding('utf-8') unless content.encoding.to_s = "UTF-8"

Then when you read it you can read force it back

<%= raw truncate(strip_tags(post.content.force_encoding('utf-8')), :length => 200) %>
twooface
  • 450
  • 3
  • 10