My question is similar with the one in here, but I'm having problems implementing the answer, and I cannot comment in that thread.
So, I have a big CSV file that contains a nested data, which contains 2 columns separated by whitespace (say first column is Y, second column is X). Column X itself is also a comma-separated value.
21.66 2.643227,1.2698358,2.6338573,1.8812188,3.8708665,...
35.15 3.422151,-0.59515584,2.4994135,-0.19701914,4.0771823,...
15.22 2.8302398,1.9080592,-0.68780196,3.1878228,4.6600842,...
...
I want to read this CSV into 2 different Spark tables using sparklyr
.
So far this is what I've been doing:
Use
spark_read_csv
to import all CSV contents into Spark data tabledf = spark_read_csv(sc, path = "path", name = "simData", delimiter = " ", header = "false", infer_schema = "false")
The result is a Spark table named
simData
with 2 columns:C0
andC1
Use
dplyr
to select first & second column, and then register them as new tables named Y and X respectivelysimY <- df %>% select(C0) %>% sdf_register("simY")
simX <- df %>% select(C1) %>% sdf_register("simX")
Split the value in
simX
usingft_regex_tokenizer
function, with regards the answer written in here.ft_regex_tokenizer(input_DF, input.col = "COL", output.col = "ResultCols", pattern = '\\###')
But when I try to head
it using dplyr
:
Source: query [6 x 1]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
Result
<list>
1 <list [789]>
2 <list [789]>
3 <list [789]>
4 <list [789]>
5 <list [789]>
6 <list [789]>
I want to turn this into a new Spark table and convert the type to double. Is there any way to do this?
I've considered to collect
the data into R (using dplyr
), convert to matrix, and then do strsplit
for each row, but I think this is not a solution because the CSV size can go up to 40GB.
EDIT: Spark version is 1.6.0