1

When I read a CSV file that includes Chinese characters using the csv crate, it has a error.

fn main() {
    let mut rdr =
        csv::Reader::from_file("C:\\Users\\Desktop\\test.csv").unwrap().has_headers(false);
    for record in rdr.decode() {
        let (a, b): (String, String) = record.unwrap();
        println!("a:{},b:{}", a, b);
    }
    thread::sleep_ms(500000);
}

The error:

Running `target\release\rust_Work.exe`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Decode("Could not convert bytes \'FromUtf8Error { bytes: [208, 213, 195, 251], error: Utf8Error { va
lid_up_to: 0 } }\' to UTF-8.")', ../src/libcore\result.rs:788
note: Run with `RUST_BACKTRACE=1` for a backtrace.
error: Process didn't exit successfully: `target\release\rust_Work.exe` (exit code: 101)

test.csv:

 1. 姓名   性别    年纪    分数     等级 
 2. 小二    男     12      88      良好
 3. 小三    男     13      89      良好 
 4. 小四    男     14      91      优秀

enter image description here

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
songroom
  • 21
  • 3
  • 2
    This isn't reproducible. Please provide the exact CSV data you're using and show the full output of your program. Please also explain what you expect to happen. The CSV crate should have no problems with Chinese characters, so you've likely misdiagnosed the issue. – BurntSushi5 Feb 19 '17 at 14:08
  • 5
    Perhaps the problem is the CSV not being encoded in UTF-8. Most Rust code will *only* work with UTF-8. If the file is encoded with UTF-16, UTF-32, Big5, GBK, or anything else, that is likely the problem. – DK. Feb 19 '17 at 14:59
  • @DK. shouldn't *something* have complained that it wasn't UTF-8 though? – Shepmaster Feb 19 '17 at 15:01
  • 1
    @Shepmaster "program has a panic! bug" - that's vague enough that it *could* be a UTF-8 error... or pretty much anything else. I'm guessing, here. – DK. Feb 19 '17 at 15:03
  • @Shepmaster yes, is so hard to realize that this is the error ¬¬_, you need super powers. Also, the answer must be the entire program working, even with a input that is space-separated-values. Is better to do all the work so the OP does nothing. – freinn Feb 19 '17 at 15:06
  • 1
    @freinn *even with a input that is space-separated-values* — then it wouldn't be a **comma**-separated value (CSV) file. – Shepmaster Feb 19 '17 at 15:08
  • 2
    The term CSV is frequently used even if the delimiter isn't a comma. Modifying the delimiter is supported, but it is only allowed to be a single byte: https://docs.rs/csv/0.15.0/csv/struct.Reader.html#method.delimiter (The CSV crate reads "ASCII-compatible" data.) – BurntSushi5 Feb 19 '17 at 15:26
  • I add the csv data image,thank you – songroom Feb 20 '17 at 13:55
  • @songroom No Shepmaster means you should copy-and-paste the content of `./data/simple.csv` here, and also paste the output of your program from the terminal. – kennytm Feb 21 '17 at 09:27
  • how to copy csv file to here? – songroom Mar 03 '17 at 13:28

3 Answers3

0

I have a way to solve it. Thanks all.

extern crate csv;
extern crate rustc_serialize;
extern crate encoding;
use encoding::{Encoding, EncoderTrap, DecoderTrap};
use encoding::all::{GB18030};
use std::io::prelude::*;

fn main() {
    let path = "C:\\Users\\Desktop\\test.csv";
    let mut f = File::open(path).expect("cannot open file");
    let mut reader: Vec<u8> = Vec::new();
    f.read_to_end(&mut reader).expect("can not read file");
    let mut chars = String::new();
    GB18030.decode_to(&mut reader, DecoderTrap::Ignore, &mut chars);
    let mut rdr = csv::Reader::from_string(chars).has_headers(true);
    for row in rdr.decode() {
        let (x, y, r): (String, String, String) = row.unwrap();
        println!("({}, {}): {:?}", x, y, r);
    }
}

output:

enter image description here

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
-1

I'm not sure what could be done to make the error message more clear:

Decode("Could not convert bytes 'FromUtf8Error { bytes: [208, 213, 195, 251], error: Utf8Error { valid_up_to: 0 } }' to UTF-8.")

FromUtf8Error is documented in the standard library, and the text of the error says "Could not convert bytes to UTF-8" (although there's some extra detail in the middle).

Simply put, your data isn't in UTF-8 and it must be. That's all that the Rust standard library (and thus most libraries) really deal with. You will need to figure out what encoding it is in and then find some way of converting from that to UTF-8. There may be a crate to help with either of those cases.

Perhaps even better, you can save the file as UTF-8 from the beginning. Sadly, it's relatively common for people to hit this issue when using Excel, because Excel does not have a way to easily export UTF-8 CSV files. It always writes a CSV file in the system locale encoding.

Community
  • 1
  • 1
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
-3

Part 1: Read Unicode (Chinese or not) characters:

The easiest way to achieve your goal is to use the read_to_string function that mutates the String you pass to it, appending the Unicode content of your file to that passed String:

use std::io::prelude::*;
use std::fs::File;

fn main() {
    let mut f = File::open("file.txt").unwrap();
    let mut buffer = String::new();

    f.read_to_string(&mut buffer);

    println!("{}", buffer)
}

Part 2: Parse a CSV file, its delimiter being a ',':

extern crate regex;
use regex::Regex;

use std::io::prelude::*;
use std::fs::File;

fn main() {
    let mut f = File::open("file.txt").unwrap();
    let mut buffer = String::new();
    let delimiter = ",";

    f.read_to_string(&mut buffer);
    let modified_buffer = buffer.replace("\n", delimiter);
    let mut regex_str = "([^".to_string();

    regex_str.push_str(delimiter);
    regex_str.push_str("]+)");

    let mut final_part = "".to_string();
    final_part.push_str(delimiter);
    final_part.push_str("?");

    regex_str.push_str(&final_part);

    let regex_str_copy = regex_str.clone();
    regex_str.push_str(&regex_str_copy);
    regex_str.push_str(&regex_str_copy);

    let re = Regex::new(&regex_str).unwrap();

    for cap in re.captures_iter(&modified_buffer) {
        let (s1, s2, dist): (String, String, usize) =
            (cap[1].to_string(), cap[2].to_string(), cap[3].parse::<usize>().unwrap());
         println!("({}, {}): {}", s1, s2, dist);
    }
}

Sample input and output here

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
freinn
  • 1,049
  • 5
  • 14
  • 23