1

I'm trying to read a normal file from HDFS in class which I would be executing through spark-submit.

I have a method which does a String operations and its from this string output I create RDD.

I'm performing the below string operations before creating an RDD.

Should I use a StringBuilder or a StringBuffer for the variable valueString ?

        while ((line = bf.readLine()) != null) {
        String trimmedLine=line.trim();
        if(trimmedLine.charAt((trimmedLine.length()-1))==';'){
            if(extractionInProgress){
                valueString=valueString.concat(trimmedLine.substring(0,trimmedLine.indexOf(";")));
                keyValues.put(searchKey, valueString);
                extractionInProgress=false;
                valueString="";
            }
            else{
                int indexOfTab=trimmedLine.indexOf(" ");
                if(indexOfTab > -1){
                    String keyInLine=trimmedLine.substring(0,indexOfTab);
                    valueString=trimmedLine.substring(indexOfTab+1,trimmedLine.indexOf(";"));
                    keyValues.put(keyInLine, valueString);
                    valueString="";
                }
            }
        }
        else{
            if(!extractionInProgress){
                searchKey=trimmedLine;
                extractionInProgress=true;
            }
            else{
                valueString=valueString.concat(trimmedLine.concat("\n"));
            }
        }
    }
Yaron
  • 10,166
  • 9
  • 45
  • 65
John Thomas
  • 212
  • 3
  • 21

1 Answers1

1

The only difference between the two is that StringBuffer has synchronized methods (which is something you almost never need). So keep the valueString a local variable and go with StringBuilder.

valueString=valueString.concat(trimmedLine.concat("\n"));

This kind of code makes me wonder if you want to concatenate a multi-line String at all. Maybe you can produce an RDD with a List of lines instead and move some of the current pre-processing into a Spark job itself?

Community
  • 1
  • 1
Thilo
  • 257,207
  • 101
  • 511
  • 656
  • I understand the thread safe part, the doubt i had was that ,does multithreading has anything to do with local variables , while i execute through spark-submit ? – John Thomas May 14 '17 at 07:45
  • thread-safety is only an issue if you share the object between multiple threads. If your StringBuilder only lives within a single method and you never pass it anywhere (which is how they are usually used, only the resulting String gets output), then you are fine. – Thilo May 14 '17 at 07:48
  • "If your StringBuilder only lives within a single method". Local variables live on the stack. If ten threads run the same method, they each get their own independent "copy". (Of course, if you pass the reference to somewhere outside the method, such as setting it to an object field, then other threads may get a hold of the instance). – Thilo May 14 '17 at 07:50