1

My computer has no idea what this character is. It came from Excel.

In excel it was a weird space, now it is literally represented by several symbols viz. my computer has no idea what it is.

This character is represented by a Ê in Excel (in csv, as xls it is a space of some kind), OS X's TextEdit treats it as a big space this long "            ", which is, I think, what it is. Ruby's CSV parser blows up when it tries to parse it using normal utf-8, and I have to add :encoding => "windows-1251:utf-8" to parse it, in which case Ruby turns it into an "K". This K appears in groups of 9, 12, 15 and 18 (KKKKKKKKK, etc) in my CSV, and cannot be removed via gsub(/K/) (groups of K, /KKKKKKKKK/, etc, cannot be removed either)! I've also used the opensource tool CSVfix, but its "removing leading and trailing spaces" command did not have an effect on the Ks.

I've tried using sed as suggested in Remove non-ascii characters from csv, but got errors like

sed: 1: "output.csv": invalid command code o

when running something like sed -i 's/[\d128-\d255]//' input.csv on Mac.

Community
  • 1
  • 1
boulder_ruby
  • 38,457
  • 9
  • 79
  • 100

4 Answers4

0

**self-answers (different account, same person)

1st solution attempt:

evil_string_from_csv_cell = "KKKKKKKKK"
encoding_opts = {
  :invalid => :replace, :undef => :replace,
  :replace => '', :universal_newline => true }
evil_string_from_csv_cell.encode Encoding.find('ASCII'), encoding_opts
#=> ""

2nd solution attempt:

Don't use 'windows-1251:utf-8' for encoding, use 'iso-8859-1' instead, which will turn those (cyrillic) K's into '\xCA', which can then be removed with

string.gsub!(/\xCA/, '')

** I have not solved this problem yet.

3rd solution attempt:

trying to match array of K's as if they were actual K's is foolish. Copy and paste in the actual cyrillic K and see how that works-- here is the character, notice the little curl on the end

К

ruby treats it by making it a little bit bolder than normal K's

4th solution/strategy attempt (success):

  • use regular expressions to capture the characters, so long as you can encode the weird spaces (or whatever they are) into something, you can then ignore them using regular expressions
  • also try to take advantage of any spatial (matrix-like) patterns amongst the document types.
boulder_ruby
  • 38,457
  • 9
  • 79
  • 100
0

Parse your csv with the following to remove your "evil" character

.encode!("ISO-8859-1", :invalid => :replace)
peter
  • 41,770
  • 5
  • 64
  • 108
  • It didn't work. I actually have some other people looking at this issue because it has literally reduced my life by 12 days – boulder_ruby Oct 18 '12 at 05:54
  • I'm used to being powerful in ruby, this is a very stressful issue---btw, again, I've been told my partners to not look at this issue any more, that they will handle it, but the error had something to do with being converted from utf-8 or something, "invalid bit error" or something like that, so some sort of binary incompatibility issue – boulder_ruby Oct 18 '12 at 06:00
  • i understand your grieve, btw did you save your .rb file itself in UTF-8 ? and just out of curiosity, could you provide a link to a small csv file giving the problem ? i have some experience with these problems – peter Oct 18 '12 at 12:17
0

The answer to this problem is

A.) this is a very difficult problem. no one so far knows how to "physically" remove the cyrillic Ks.

but

B.) csv files are just strings separated by unescaped commas, so matching strings using regular expressions works just find so long as the encoding doesn't break the program.

So to read the file

f = File.open(File.join(Rails.root, 'lib', 'assets', 'repo', name), :encoding => "windows-1251:utf-8")
parsed = CSV.parse(f)

then find specific rows via regular expression literal string matching (it will overlook the cyrillic K's)

parsed.each do |p|           #here, p[0] is the metatag column
  @specific_metatag_row = parsed.index if p[0] =~ /MetatagA/
end
boulder_ruby
  • 38,457
  • 9
  • 79
  • 100
0

I couldn't get sed working but finally had luck doing this in Vim:

vim myhorriblefile.csv

# Once vim is open:
:s/Ê/ /g
:wq

# Done!

As a generalized function for reuse, this can be:

clean_weird_character () {
  vim "$1" -c ":%s/Ê/ /g" -c "wq"
}
jdotjdot
  • 16,134
  • 13
  • 66
  • 118