How to replace white space with comma in Spark ( with Scala)?

Question

I have a log file like this. I want to create a DataFrame in Scala.

2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2

I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.

Here is everything I tried:

Tried importing it as text file first to see if there is a replaceAll method.
Tried splitting on the basis of space.

Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..

Possible duplicate of [how to use Regexp\_replace in spark](https://stackoverflow.com/questions/40080609/how-to-use-regexp-replace-in-spark) — 10465355, Nov 26 '18 at 20:54

score 1 · Answer 1 · answered Nov 27 '18 at 07:08

1

You can simply tell spark that your delimiter is a white space like this:

val df = spark.read.option("delimiter", " ").csv("path/to/file")

answered Nov 27 '18 at 07:08

Oli

9,766
5
25
46

score 0 · Answer 2 · answered Nov 26 '18 at 21:24

0

Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema. Roughly:

val rdd = sc.textFile({logline path}).map(line=>line.split("\\s+"))

Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.

A simpler way to get up and going though would be

spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)

answered Nov 26 '18 at 21:24

benlaird

839
7
9

Almost precise. Thank you. – San Nov 26 '18 at 21:30
It is also replacing the space inside the quotes. Looking for a way to overcome it. – San Nov 26 '18 at 22:09
Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that – benlaird Nov 26 '18 at 22:23
Scala CSV reader: https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.sql.DataFrameReader@csv(paths:String*):org.apache.spark.sql.DataFrame – benlaird Nov 26 '18 at 22:26
I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it. – San Nov 26 '18 at 22:45
Thanks a lot @benlaird I think I figured it out. – San Nov 26 '18 at 22:51

score 0 · Answer 3 · answered Nov 27 '18 at 13:06

If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.

import org.apache.commons.csv.CSVParser._
val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""
val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)
val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)
println(http)
println(curl)

Results:

GET https://www.example.com:443/ HTTP/1.1
curl/7.38.0

How to replace white space with comma in Spark ( with Scala)?

3 Answers3