How to import csv data file into scikit-learn?

Question

From my understanding, the scikit-learn accepts data in (n-sample, n-feature) format which is a 2D array. Assuming I have data in the form ...

Stock prices    indicator1    indicator2
2.0             123           1252
1.0             ..            ..
..              .             . 
.

How do I import this?

score 70 · Answer 1 · edited Aug 04 '23 at 00:41

A very good alternative to numpy loadtxt is read_csv from Pandas. The data is loaded into a Pandas dataframe with the big advantage that it can handle mixed data types such as some columns contain text and other columns contain numbers. You can then easily select only the numeric columns and convert to a numpy array with as_matrix. Pandas will also read/write excel files and a bunch of other formats.

If we have a csv file named "mydata.csv":

point_latitude,point_longitude,line,construction,point_granularity
30.102261, -81.711777, Residential, Masonry, 1
30.063936, -81.707664, Residential, Masonry, 3
30.089579, -81.700455, Residential, Wood   , 1
30.063236, -81.707703, Residential, Wood   , 3
30.060614, -81.702675, Residential, Wood   , 1

This will read in the csv and convert the numeric columns into a numpy array for scikit_learn, then modify the order of columns and write it out to an excel spreadsheet:

import numpy as np
import pandas as pd

input_file = "mydata.csv"


# comma delimited is the default
df = pd.read_csv(input_file, header = 0)

# for space delimited use:
# df = pd.read_csv(input_file, header = 0, delimiter = " ")

# for tab delimited use:
# df = pd.read_csv(input_file, header = 0, delimiter = "\t")

# put the original column names in a python list
original_headers = list(df.columns.values)

# remove the non-numeric columns
df = df._get_numeric_data()

# put the numeric column names in a python list
numeric_headers = list(df.columns.values)

# create a numpy array with the numeric values for input into scikit-learn
numpy_array = df.to_numpy()

# reverse the order of the columns
numeric_headers.reverse()
reverse_df = df[numeric_headers]

# write the reverse_df to an excel spreadsheet
reverse_df.to_excel('path_to_file.xls')

Ok but how to create a scikit learn dataset from that matrix? — Ramy Al Zuhouri, Dec 26 '17 at 12:25
Scikit learn can take pandas dataframes as inputs so it is almost ready. Assuming that "point_granularity" is the target variable you could do y = df['point_granularity'] and X = df[['point_latitude'',point_longitude','line,construction']] — denson, Dec 28 '17 at 11:17
Since some of the features are categorical you would need to one-hot-encode them for most scikit-learn models: https://stackoverflow.com/a/43038709/1810559 — denson, Dec 28 '17 at 11:20

Fred Foo · Accepted Answer · 2013-02-06T15:41:22.837

54

This is not a CSV file; this is just a space separated file. Assuming there are no missing values, you can easily load this into a Numpy array called data with

import numpy as np

f = open("filename.txt")
f.readline()  # skip the header
data = np.loadtxt(f)

If the stock price is what you want to predict (your y value, in scikit-learn terms), then you should split data using

X = data[:, 1:]  # select columns 1 through end
y = data[:, 0]   # select column 0, the stock price

Alternatively, you might be able to massage the standard Python csv module into handling this type of file.

edited Feb 06 '13 at 15:41

answered Jun 14 '12 at 15:04

Fred Foo

355,277
75
744
836

1

Is there a way to maintain feature names using this method? – AlexFZ Oct 03 '13 at 16:12
2

@AlexFZ: not directly. Instead of just `f.readline()`, you can do `feature_names = f.readline().split()` or some variant of it (the OP's header line isn't nicely space-separated). [Pandas](http://pandas.pydata.org) has nicer functionality for this. – Fred Foo Oct 04 '13 at 09:31
6

Although the questioner provided a space separated file, the question is posed in regards to a csv data file. – tumultous_rooster Jan 26 '14 at 00:20
3

the code you specified generate error ValueError: could not convert string to float:, because my data are string! how to fix that? – Chedi Bechikh Jan 30 '17 at 14:31

score 20 · Answer 3 · edited Jul 23 '14 at 05:08

20

You can look up the loadtxt function in numpy.

To get the optional inputs into the loadtxt method.

A simple change for csv is

data =  np.loadtxt(fname = f, delimiter = ',')

edited Jul 23 '14 at 05:08

Baby Groot

4,637
39
52
71

answered Jul 23 '14 at 05:02

William komp

1,237
9
4

score 2 · Answer 4 · edited Aug 13 '18 at 10:42

2

Use `numpy` to load csvfile

import numpy as np
dataset = np.loadtxt('./example.csv', delimiter=',')

edited Aug 13 '18 at 10:42

Jan Trienes

2,501
1
16
28

answered Nov 10 '17 at 10:58

sixsixsix

1,768
21
19

How to import csv data file into scikit-learn?

4 Answers4

Use `numpy` to load csvfile

Linked

How to import csv data file into scikit-learn?

4 Answers4

Use numpy to load csvfile

Linked

Use `numpy` to load csvfile