Given 3 csv files of the same number of rows, like these
fx.csv
:
7.23,4.41,0.17453,0.12
6.63,3.21,0.3453,0.32
2.27,2.21,0.3953,0.83
f0.csv
:
1.23,3.21,0.123,0.12
8.23,9.21,0.183,0.32
7.23,6.21,0.123,0.12
and f1.csv
:
6.23,3.21,0.153,0.123
2.23,2.26,0.182,0.22
9.23,9.21,0.183,0.135
The f0.csv
and f1.csv
come with corresponding labels 0
s and 1
s.
The goal is to read into a dask.DataFrame
. The concatenated values such that we get
fx.csv
concatenated horizontally withf0.csv
and0
sfx.csv
concatenated horizontally withf1.csv
and1
s- concatenated (1) and (2) vertically
I have tried doing this to read them into the dask file and save into a hdf store:
import dask.dataframe as dd
import dask.array as da
fx = dd.read_csv('fx.csv', header=None)
f0 = dd.read_csv('f0.csv', header=None)
f1 = dd.read_csv('f1.csv', header=None)
l0 = dd.from_array(np.array([1] * len(fx)))
l1 = dd.from_array(np.array([1] * len(fx)))
da.to_np_stack('data/',
da.concatenate( [
dd.concat([fx.compute(), f0.compute(), l0.compute()], axis=1),
dd.concat([fx.compute(), f1.compute(), l1.compute()], axis=1)
], axis=0, allow_unknown_chunksizes=True),
axis=0)
I can also do these in unix before reading it into the dask file, like this:
# Create the label files.
$ wc -l fx.csv
4
$ seq 4 | sed "c 0" > l0.csv
$ seq 4 | sed "c 0" > l1.csv
# Concat horizontally
$ paste fx.csv f0.csv l0.csv -d"," > x0.csv
$ paste fx.csv f1.csv l1.csv -d"," > x1.csv
$ cat x0.csv x1.csv > data.csv
The actual dataset has 256 columns for each f*.csv
files and 22,000,000 rows. So it isn't easy to run the dask python code.
My questions (in parts are):
Is the dask method in the Python code the easiest/memory efficient method to read the data and output it into a hdf5 store?
Is there any other method that is more efficient than the unix way described above?