2

I'm using (and stuck with) the following version of Ruby:

ruby 1.8.7 (2012-06-29 patchlevel 370) [x86_64-linux]

I tried a lot of Googling, but I can't find a working answer to my problem. I'm importing a CSV file that will usually come from the user's Microsoft Excel spreadsheet. I'm having no trouble with the CSV part but I can't figure out how to handle MS "smart" quotes. My input file for my test is in DOS format and contains this line:

Jeanne O�Neill

There's an MS curly apostrophe between the O and N in O'Neill, which shows in my text editor as the "question mark diamond". When I try the following code, the curly apostrophe gets dropped:

# replace Microsoft Office 'smart' quotes

# gem to detect character encoding
require 'rchardet'
if name != nil
  cd = CharDet.detect(name)
  encoding = cd['encoding']
  name = Iconv.conv('UTF-8//TRANSLIT', encoding, name)
end

This yields the undesirable output:

Jeanne ONeill

Is there a way to write a regular expression in Ruby 1.8.7 that will detect the curly MS characters and replace them with straight ones? I've tried using hex codes in my regexes, but I can't make them work. I'm aware that Ruby 1.8.7 is much more limited in handling character encodings that 1.9, but I'm stuck with it. Upgrading Ruby isn't possible right now in this project.

Any help would be appreciated. Thank you.

After reading the post suggested by TinMan, I tried using gsub to replace the resulting '�' sub-string:

if name != nil
  name = Iconv.conv("UTF-8", "cp1252//TRANSLIT", name)
  name.gsub(/\u00ef\u00bf\u00bd/u, "'")
end

Alas, no love. It still yields the same result :(

Steven Hirlston
  • 1,869
  • 1
  • 15
  • 19
  • I get a ? diamond on OSX too. That's one weird character. – Linuxios Dec 28 '12 at 17:26
  • Why don't you show us the code you wrote to use hex-codes in your regexes? Fixing your code is a lot better than us writing entirely new, unrelated code, and you having to rewrite it again to incorporate it. – the Tin Man Dec 28 '12 at 17:36
  • possible duplicate of [Can I use iconv to convert multi-byte smart quotes to extended ASCII smart quotes?](http://stackoverflow.com/questions/6087309/can-i-use-iconv-to-convert-multi-byte-smart-quotes-to-extended-ascii-smart-quote) – the Tin Man Dec 28 '12 at 17:37
  • I would have posted the regex in question to show that I've done my due diligence but I zapped it from github since it didn't work. A single regex (or a tip to make one) to replace curly apostrophes from my `name` variable would be welcome and would require little or no rework. The related post looks like it would work, but it doesn't. From the command line, `iconv -f "cp1252//translit" -t "utf-8" test/fixtures/files/NAME_field_examples.csv` turns "Jeanne O�Neill" into "Jeanne O�Neill" – Steven Hirlston Dec 28 '12 at 20:14

1 Answers1

0

I did this in PHP and it worked perfect. Maybe you can try the Ruby equivalent if it exists?

$text = str_replace('�', '"', $text);

To account for apostrophies and escaping for MySQL, I had to update my code to this...

$bad_symbols = array('�t', '�s', '�ll', '�ve', '�d', '�re', '� ', ' �');
$replacements_for_bad_symbols = array("\'t", "\'s", "\'ll", "\'ve", "\'d", "\'re", '" ', ' "');
$text = str_replace($bad_symbols, $replacements_for_bad_symbols, $text);
TenTen71
  • 13
  • 3