0

Having a CSV file with the following format:

18.0   8   307.0      130.0      3504.      12.0   70  1        "chevrolet chevelle malibu"
15.0   8   350.0      165.0      3693.      11.5   70  1        "buick skylark 320"
18.0   8   318.0      150.0      3436.      11.0   70  1        "plymouth satellite"
16.0   8   304.0      150.0      3433.      12.0   70  1        "amc rebel sst"
17.0   8   302.0      140.0      3449.      10.5   70  1        "ford torino"

I am able to read the csv file with Pandas as follows:

column_names = [
    'MPG', 'Cylinders', 'Displacement', 'Horsepower',
    'Weight', 'Acceleration', 'Model Year', 'Origin'
]

df = pd.read_csv(
    DATA_PATH,
    names=column_names,
    na_values="?",
    comment='\t',
    sep=" ",
    skipinitialspace=True
)

Now I am trying to read the same datase in DataFusion as follows:

use datafusion::{prelude::*};


fn get_csv_option<'a>() -> CsvReadOptions<'a> {
    let mut csv_opt = CsvReadOptions::new();
    csv_opt.has_header = false;
    csv_opt.delimiter = b' ';
    csv_opt
}

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let read_option = get_csv_option();
    let ctx = SessionContext::new();
    let df = ctx.read_csv("data/landing/auto-mpg.data", read_option).await?;
    println!("{}", df.schema());
    df.show().await?;
    Ok(())
}

which produce nothing and the final output is:

(mpg-car-pipeline-U1cqCC4U-py3.9) datapsycho@dataops  ~/.../mpg-car-pipeline $ cargo run
   Compiling mpg-car-pipeline v0.1.0 (/home/datapsycho/RustProjects/mpg-car-pipeline)
    Finished dev [unoptimized + debuginfo] target(s) in 10.22s
     Running `target/debug/mpg-car-pipeline`
fields:[], metadata:{}
++
++

How can I read the data in DataFusion with added column name and schema? I have looked into the API doc but there is not enough example on CsvReadOptions struct. Data file can be downloaded with the following command:

wget http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
DataPsycho
  • 958
  • 1
  • 8
  • 28
  • Your file is not a csv file, you're lucky pandas reads it with your specified parameters. To work with datafusion you probably have to write a custom parser. – cafce25 Nov 14 '22 at 13:04
  • Yes you are correct it is not a csv file. What I got from internet that is a file format from Analysis Studio. Will try some other approach. – DataPsycho Nov 16 '22 at 08:05
  • Is it tab delimeted? – Metehan Yıldırım Feb 20 '23 at 07:38
  • Oh, no it is not tab delimeted. The file is generated from sas probably. Some how pandas can read it. but probably I should close that qna, it can not be read with datafusion at the moment. The csv version of the data is already available in Kaggle. – DataPsycho Feb 20 '23 at 09:41

0 Answers0