Desperately trying to remove this diabolical excel generated special character from csv in ruby

Question

My computer has no idea what this character is. It came from Excel.

In excel it was a weird space, now it is literally represented by several symbols viz. my computer has no idea what it is.

This character is represented by a Ê in Excel (in csv, as xls it is a space of some kind), OS X's TextEdit treats it as a big space this long " ", which is, I think, what it is. Ruby's CSV parser blows up when it tries to parse it using normal utf-8, and I have to add :encoding => "windows-1251:utf-8" to parse it, in which case Ruby turns it into an "K". This K appears in groups of 9, 12, 15 and 18 (KKKKKKKKK, etc) in my CSV, and cannot be removed via gsub(/K/) (groups of K, /KKKKKKKKK/, etc, cannot be removed either)! I've also used the opensource tool CSVfix, but its "removing leading and trailing spaces" command did not have an effect on the Ks.

I've tried using sed as suggested in Remove non-ascii characters from csv, but got errors like

sed: 1: "output.csv": invalid command code o

when running something like sed -i 's/[\d128-\d255]//' input.csv on Mac.

I would love to have a way to get ruby to say "no"/replace-with-nothing to this character and completely ignore it from the start. — boulder_ruby, Oct 16 '12 at 01:37
I don't work with non-ascii text, but have you tried opening the text with Ruby using ASCII-8BIT and finding and replacing the evil character that way? — Andrew Grimm, Oct 18 '12 at 22:24

boulder_ruby · Answer 1 · 2012-10-20T00:56:17.890

**self-answers (different account, same person)

1st solution attempt:

evil_string_from_csv_cell = "KKKKKKKKK"
encoding_opts = {
  :invalid => :replace, :undef => :replace,
  :replace => '', :universal_newline => true }
evil_string_from_csv_cell.encode Encoding.find('ASCII'), encoding_opts
#=> ""

2nd solution attempt:

Don't use 'windows-1251:utf-8' for encoding, use 'iso-8859-1' instead, which will turn those (cyrillic) K's into '\xCA', which can then be removed with

string.gsub!(/\xCA/, '')

** I have not solved this problem yet.

3rd solution attempt:

trying to match array of K's as if they were actual K's is foolish. Copy and paste in the actual cyrillic K and see how that works-- here is the character, notice the little curl on the end

К

ruby treats it by making it a little bit bolder than normal K's

4th solution/strategy attempt (success):

use regular expressions to capture the characters, so long as you can encode the weird spaces (or whatever they are) into something, you can then ignore them using regular expressions
also try to take advantage of any spatial (matrix-like) patterns amongst the document types.

The official name of this character is "U+041A" – boulder_ruby Oct 16 '12 at 02:59 — boulder_ruby, Oct 16 '12 at 02:59
another species (when using 'iso-8859-1'): "\xCA" – boulder_ruby Oct 16 '12 at 03:20 — boulder_ruby, Oct 16 '12 at 03:20

score 0 · Answer 2 · answered Oct 16 '12 at 10:02

0

Parse your csv with the following to remove your "evil" character

.encode!("ISO-8859-1", :invalid => :replace)

answered Oct 16 '12 at 10:02

peter

41,770
5
64
108

It didn't work. I actually have some other people looking at this issue because it has literally reduced my life by 12 days – boulder_ruby Oct 18 '12 at 05:54
I'm used to being powerful in ruby, this is a very stressful issue---btw, again, I've been told my partners to not look at this issue any more, that they will handle it, but the error had something to do with being converted from utf-8 or something, "invalid bit error" or something like that, so some sort of binary incompatibility issue – boulder_ruby Oct 18 '12 at 06:00
i understand your grieve, btw did you save your .rb file itself in UTF-8 ? and just out of curiosity, could you provide a link to a small csv file giving the problem ? i have some experience with these problems – peter Oct 18 '12 at 12:17

score 0 · Accepted Answer · answered Nov 02 '12 at 23:32

The answer to this problem is

A.) this is a very difficult problem. no one so far knows how to "physically" remove the cyrillic Ks.

but

B.) csv files are just strings separated by unescaped commas, so matching strings using regular expressions works just find so long as the encoding doesn't break the program.

So to read the file

f = File.open(File.join(Rails.root, 'lib', 'assets', 'repo', name), :encoding => "windows-1251:utf-8")
parsed = CSV.parse(f)

then find specific rows via regular expression literal string matching (it will overlook the cyrillic K's)

parsed.each do |p|           #here, p[0] is the metatag column
  @specific_metatag_row = parsed.index if p[0] =~ /MetatagA/
end

jdotjdot · Answer 4 · 2016-04-25T15:23:44.677

0

I couldn't get sed working but finally had luck doing this in Vim:

vim myhorriblefile.csv

# Once vim is open:
:s/Ê/ /g
:wq

# Done!

As a generalized function for reuse, this can be:

clean_weird_character () {
  vim "$1" -c ":%s/Ê/ /g" -c "wq"
}

edited Apr 25 '16 at 15:23

answered Feb 01 '16 at 19:39

jdotjdot

16,134
13
66
118

Desperately trying to remove this diabolical excel generated special character from csv in ruby

4 Answers4