I am doing a Spark project. In the following code, I have a string which I use to collect my results in order to write to a file later on (I know this is not the correct way, I am just checking what is inside a Tuple3
returned by a method). The string gets truncated after a for each loop. Here is the relevant part of my code:
val newLine = sys.props("line.separator") // also tried "\n". I am using OS X.
var str = s"*** ${newLine}"
for (tuple3 <- ArrayOfTuple3s) {
for (list <- tuple3._3) {
for (strItem <- list) {
str += s"${strItem}, "
}
str += s"${newLine}"
}
str += s"${newLine}"
println(tempStr)
}
print("str=" + str)
The first println
method call prints the correct value of the string (the concatenated result), but when the loop ends, the value of str
is ***
(the same value assigned to it before the first loop).
Edit: I replaced the str
immutable String
object with a StringBuilder
, but no change in the result:
val newLine: String = sys.props("line.separator")
var str1: StringBuilder = new StringBuilder(15000)
for (tuple3 <- ArrayOfTuple3s) {
for (list <- tuple3._3) {
for (str <- list) {
str1.append(s"${str}, ")
}
str1.append(s"${newLine}")
}
str1.append(s"${newLine}")
println(str1.toString())
}
print("resulting str1=" + str1.toString())
Edit 2: I mapped the RDD to take the Tuple3's third field directly. This field itself is an RDD of Arrays of Lists. I changed the code accordingly, but I am still getting the same result (the resulting string is empty, although inside the for loop it is not).
val rddOfArraysOfLists = getArrayOfTuple3s(mainRdd).map(_._3)
for (arrayOfLists <- rddOfArraysOfLists) {
for (list <- arrayOfLists) {
for (field <- list) {
str1.append(s"${field}, ")
}
str1.append(" -- ")
}
str1.append(s"${newLine}")
println(str1.toString())
}
Edit 4: I think the problem is not with strings at all. There seems to be a problem with all types of variables.
var count = 0
for (arrayOfLists <- myArray) {
count = arrayOfLists.last(3).toInt
println(s"count=$count")
}
println(s"count=$count")
The value is non-zero inside the loop, but it is 0 outside the loop. Any idea?
Edit 5: I cannot publish the whole code (due to confidentiality restrictions), but here is the major part of it. If it matters, I am running Spark on my local machine in Intellij Idea (for debugging).
System.setProperty("spark.cores.max", "8")
System.setProperty("spark.executor.memory", "15g")
val sc = new SparkContext("local", getClass.getName)
val samReg = sc.objectFile[Sample](sampleLocation, 200).distinct
val samples = samReg.filter(f => f.uuid == "dce03545e8034242").sortBy(_.time).cache()
val top3Samples = samples.take(3)
for (sample <- top3Samples) {
print("sample: ")
println(s"uuid=${sample.uuid}, time=${sample.time}, model=${sample.model}")
}
val firstTimeStamp = samples.first.time
val targetTime = firstTimeStamp + 2592000 // + 1 month in seconds (samples during the first month)
val rddOfArrayOfSamples = getCountsRdd(samples.filter(_.time <= targetTime)).map(_._1).cache()
// Due to confidentiality matters, I cannot reveal the code,
// but here is a description:
// I have an array of samples. Each sample has a few String fields
// and is represented by a List[String]
// The above RDD is of the type RDD[Array[List[String]]].
// It contains only a single array of samples
// (because I passed a filtered set of samples to the function),
// but it may contain more.
// The fourth field of each sample (list) is an increasing number (count)
println(s"number of arrays in the RDD: ${rddOfArrayOfSamples.count()}")
var maxCount = 0
for (arrayOfLists <- rddOfArrayOfSamples) {
println(s"Last item of the array (a list)=${arrayOfLists.last}")
maxCount = arrayOfLists.last(3).toInt
println(s"maxCount=${maxCount}")
}
println(s"maxCount=${maxCount}")
The output:
sample: uuid=dce03545e8034242, time=1360037324, model=Nexus 4
sample: uuid=dce03545e8034242, time=1360037424, model=Nexus 4
sample: uuid=dce03545e8034242, time=1360037544, model=Nexus 4
number of arrays in the RDD: 1
Last item of the array (a list)=List(dce03545e8034242, Nexus 4, 1362628767, 32, 2089, 0.97, 0.15999999999999992, 0)
maxCount=32
maxCount=0