What the difference between using single quote and double quote in split() method in scala?

Question

I am working on the cca-175 practice questions. I am given a text file which is split by |:

Christopher|Jan 11, 2015, |5 
Kapil|11 Jan, 2015|5
Thomas|6/17/2014|5
John|22-08-2013|5
Mithun|2013|5
Jitendra||5

Then I saved the file as an RDD and tried to map it. However, when used single quote and double quote in the split method, Scala returns two different outcomes and using the single quote is right.

Using single quoteline.split('|') , it returned: Array[String] = Array(Christopher, Jan 11, 2015, 5), which is right.

Using double quote line.split("|"), it returned : Array[String] = Array(C, h, r, i, s, t, o, p, h, e, r, |, J, a, n, " ", 1, 1, , " ", 2, 0, 1, 5, |, 5), which is not what I need.

Can anyone help me with the question? Thanks!

scala> val feedbackmap = feedback.map(line=>line.split('|'))
feedbackmap: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[4] at map at <console>:29

scala> feedbackmap.first
19/04/10 14:15:55 INFO SparkContext: Starting job: first at <console>:32
19/04/10 14:15:55 INFO DAGScheduler: Got job 4 (first at <console>:32) with 1 output partitions
19/04/10 14:15:55 INFO DAGScheduler: Final stage: ResultStage 4 (first at <console>:32)
19/04/10 14:15:55 INFO DAGScheduler: Parents of final stage: List()
19/04/10 14:15:55 INFO DAGScheduler: Missing parents: List()
19/04/10 14:15:55 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[4] at map at <console>:29), which has no missing parents
19/04/10 14:15:55 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.4 KB, free 510.7 MB)
19/04/10 14:15:55 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2003.0 B, free 510.7 MB)
19/04/10 14:15:55 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:43371 (size: 2003.0 B, free: 511.1 MB)
19/04/10 14:15:55 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1008
19/04/10 14:15:55 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[4] at map at <console>:29)
19/04/10 14:15:55 INFO TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
19/04/10 14:15:55 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 5, localhost, partition 0,ANY, 2171 bytes)
19/04/10 14:15:55 INFO Executor: Running task 0.0 in stage 4.0 (TID 5)
19/04/10 14:15:55 INFO HadoopRDD: Input split: hdfs://nn01.itversity.com:8020/user/junyanxu/scenario_37/feedback.txt:0+58
19/04/10 14:15:55 INFO Executor: Finished task 0.0 in stage 4.0 (TID 5). 2173 bytes result sent to driver
19/04/10 14:15:55 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 5) in 7 ms on localhost (1/1)
19/04/10 14:15:55 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 
19/04/10 14:15:55 INFO DAGScheduler: ResultStage 4 (first at <console>:32) finished in 0.007 s
19/04/10 14:15:55 INFO DAGScheduler: Job 4 finished: first at <console>:32, took 0.012483 s
19/04/10 14:15:55 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 
res3: Array[String] = Array(Christopher, Jan 11, 2015, 5)
scala> 19/04/10 14:20:55 WARN SparkContext: Killing executors is only supported in coarse-grained mode
19/04/10 14:20:55 WARN ExecutorAllocationManager: Unable to reach the cluster manager to kill executor driver!
val
scala> val feedbackmap2 = feedback.map(line=>line.split("|"))
feedbackmap2: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:29
scala> feedbackmap2.first
19/04/10 14:22:58 INFO SparkContext: Starting job: first at <console>:32
19/04/10 14:22:58 INFO DAGScheduler: Got job 5 (first at <console>:32) with 1 output partitions
19/04/10 14:22:58 INFO DAGScheduler: Final stage: ResultStage 5 (first at <console>:32)
19/04/10 14:22:58 INFO DAGScheduler: Parents of final stage: List()
19/04/10 14:22:58 INFO DAGScheduler: Missing parents: List()
19/04/10 14:22:58 INFO DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[5] at map at <console>:29), which has no missing parents
19/04/10 14:22:58 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 3.4 KB, free 510.7 MB)
19/04/10 14:22:58 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 2003.0 B, free 510.7 MB)
19/04/10 14:22:58 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:43371 (size: 2003.0 B, free: 511.1 MB)
19/04/10 14:22:58 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1008
19/04/10 14:22:58 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (MapPartitionsRDD[5] at map at <console>:29)
19/04/10 14:22:58 INFO TaskSchedulerImpl: Adding task set 5.0 with 1 tasks
19/04/10 14:22:58 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 6, localhost, partition 0,ANY, 2171 bytes)
19/04/10 14:22:58 INFO Executor: Running task 0.0 in stage 5.0 (TID 6)
19/04/10 14:22:58 INFO HadoopRDD: Input split: hdfs://nn01.itversity.com:8020/user/junyanxu/scenario_37/feedback.txt:0+58
19/04/10 14:22:58 INFO Executor: Finished task 0.0 in stage 5.0 (TID 6). 2244 bytes result sent to driver
19/04/10 14:22:58 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 6) in 12 ms on localhost (1/1)
19/04/10 14:22:58 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 
19/04/10 14:22:58 INFO DAGScheduler: ResultStage 5 (first at <console>:32) finished in 0.012 s
19/04/10 14:22:58 INFO DAGScheduler: Job 5 finished: first at <console>:32, took 0.040166 s
res4: Array[String] = Array(C, h, r, i, s, t, o, p, h, e, r, |, J, a, n, " ", 1, 1, ,, " ", 2, 0, 1, 5, |, 5)

See https://stackoverflow.com/questions/21524642/splitting-string-with-pipe-character — Tzach Zohar, Apr 10 '19 at 19:47

score 2 · Answer 1 · answered Apr 10 '19 at 19:18

2

in scala single quote denotes a char so split('|') uses the | char. When you use double quotes you use a string and specifically split can accept a regex string so the unescaped | inside a string is interpreted as the regex or

answered Apr 10 '19 at 19:18

Arnon Rotem-Gal-Oz

25,469
3
45
68

score 1 · Answer 2 · answered Apr 10 '19 at 19:53

I think Arnon Rotem-Gal-Oz made a good point about the meaning of the | inside a string as argument to split: it's a logical operator.

Moreover, what's happening here is that you use regex which means empty string or empty string. As empty string can be found basically anywhere in a String (if it helps you you can understand that "ab" is equivalent to "a" + "" + "b"), split is made between each character.

See also scala string.split does not work which states:

If you use split('|') or split("""\|""") you should get what you want.

Indeed, an escaped | isn't anymore considered as a logical operator but as the character itself in a regex expression.

What the difference between using single quote and double quote in split() method in scala?

2 Answers2