1

I have a log file like this. I want to create a DataFrame in Scala.

2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2

I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.

Here is everything I tried:

  1. Tried importing it as text file first to see if there is a replaceAll method.
  2. Tried splitting on the basis of space.

Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..

San
  • 17
  • 5
  • Possible duplicate of [how to use Regexp\_replace in spark](https://stackoverflow.com/questions/40080609/how-to-use-regexp-replace-in-spark) – 10465355 Nov 26 '18 at 20:54

3 Answers3

1

You can simply tell spark that your delimiter is a white space like this:

val df = spark.read.option("delimiter", " ").csv("path/to/file")
Oli
  • 9,766
  • 5
  • 25
  • 46
0

Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema. Roughly:

val rdd = sc.textFile({logline path}).map(line=>line.split("\\s+"))

Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.

A simpler way to get up and going though would be

spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)
benlaird
  • 839
  • 7
  • 9
  • Almost precise. Thank you. – San Nov 26 '18 at 21:30
  • It is also replacing the space inside the quotes. Looking for a way to overcome it. – San Nov 26 '18 at 22:09
  • Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that – benlaird Nov 26 '18 at 22:23
  • Scala CSV reader: https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.sql.DataFrameReader@csv(paths:String*):org.apache.spark.sql.DataFrame – benlaird Nov 26 '18 at 22:26
  • I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it. – San Nov 26 '18 at 22:45
  • Thanks a lot @benlaird I think I figured it out. – San Nov 26 '18 at 22:51
0

If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.

import org.apache.commons.csv.CSVParser._
val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""
val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)
val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)
println(http)
println(curl)

Results:

GET https://www.example.com:443/ HTTP/1.1
curl/7.38.0
stack0114106
  • 8,534
  • 3
  • 13
  • 38