1

I have a file that contains:

user_name     order_id     M_Status
jOHN          1000         married to Emma

each "Column" is separated from the following one by 5 spaces, the spaces count can change in another string, and since there is a single space between each word under M_Status column splitting by (" +") didn't work since the M_Status need to be one string, so what I'm trying to do is count the spaces between words in the first line then split all the remaining lines by the correct number of spaces (5 but could change in another file).

UPDATE:

val delimitersList = List(",", ";", ":", "\\|", "\\t", " ")

def findCommonDelimiter(line: String, sep: Option[String], typeToCheck: String): (List[String], String) = {
  val delimiterMap = scala.collection.mutable.LinkedHashMap[String, Int]()// this needs to be changed to find how many times a delimiter is repeated between two columns
  for (a <- delimitersList)
    delimiterMap += a -> (a + "+").r.findAllIn(line).length

  try {
    val sortedMap = (delimiterMap.toList sortWith ((x, y) => x._2 > y._2)).take(3)
    var splitChar = ""
    val firstDelimiter = sortedMap.head._1.toString
    val firstDelimiterCount = sortedMap.head._2
    val secondDelimiter = sortedMap.drop(1).head._1.toString
    val secondDelimiterCount = sortedMap.drop(1).head._2
    val thirdDelimiter=sortedMap.drop(2).head._1.toString
    val lineSplit=line.split("\\r?\\n")
    if (!firstDelimiter.equalsIgnoreCase(",") &&
       secondDelimiter.equalsIgnoreCase(",") &&
       secondDelimiterCount > 0 &&
       !typeToCheck.equalsIgnoreCase("map") {//(firstDelimiterCount - commaCount) <= 1 && commaCount > 0) {
      splitChar = ","
    } else if (firstDelimiter.equalsIgnoreCase(" ") || firstDelimiter.equalsIgnoreCase("\\t")) {
      if (lineSplit(0).split(thirdDelimiter, 2).length == 2 &&
         typeToCheck.equalsIgnoreCase("map") &&
         ((secondDelimiter.equalsIgnoreCase(",") &&
         secondDelimiterCount > 0) || (secondDelimiter.equalsIgnoreCase(";") && secondDelimiterCount > 0))) {
        splitChar = thirdDelimiter
      } else if (lineSplit(0).split(secondDelimiter,2).length == 2 && typeToCheck.equalsIgnoreCase("map")) {
        splitChar = secondDelimiter
      } else if (typeToCheck.equalsIgnoreCase("header") && firstDelimiter.equalsIgnoreCase("\\t")) {
        splitChar = "\\t"
      } else if (typeToCheck.equalsIgnoreCase("header") &&
                firstDelimiter.equalsIgnoreCase(" ") &&
                secondDelimiterCount > 0) {
        if ((firstDelimiterCount- secondDelimiterCount >= firstDelimiterCount / 2))
          splitChar = secondDelimiter
      } else {
        if (firstDelimiter.equalsIgnoreCase(" ") &&
           secondDelimiterCount > 0 &&
           (firstDelimiterCount - secondDelimiterCount >= firstDelimiterCount / 2))
          splitChar = secondDelimiter
        else
          splitChar = (sortedMap.maxBy(_._2)._1).toString //.take(1)
      }
    } else
      splitChar = (sortedMap.maxBy(_._2)._1).toString //.take(1)

    if (!splitChar.equalsIgnoreCase("""\|""") && !splitChar.equalsIgnoreCase("\\t")) {
      // println("===>"+splitChar)
      // if(!splitChar.equalsIgnoreCase(""))
      (line.split(splitChar, -1).toList, splitChar)
    } else {
      if (splitChar.equalsIgnoreCase("""\|"""))
        (line.split("\\|", -1).toList, splitChar)
      else
        (line.split("\\t", -1).toList, splitChar)
    }
  } catch {
    case e: Exception => {
      e.printStackTrace()
      (List(line), "")
    }
  }
}

Thanks

Jeffrey Chung
  • 19,319
  • 8
  • 34
  • 54
sam
  • 71
  • 10
  • 2
    Try `.split("""\s{2,}""")` to split with 2 or more whitespaces. Please show your code to repro the issue. What is the expected result for the provided input, BTW? – Wiktor Stribiżew May 16 '17 at 09:26
  • Look, jan0sch posted an unnecessarily complicated version of my solution. Is it working or not? **What do you need in the end**? – Wiktor Stribiżew May 16 '17 at 09:44
  • Only the space count between header columns as other files can have one space between columns, I can do it by a loop, but was hoping for a Scala-ish answer, – sam May 16 '17 at 09:49
  • But what is your attempt then? You have not posted the code, thus, there is no issue in your question. Do you mean there can even be 1 whitespace between columns? If there can only be two or more, `.split("""\s{2,}""")` should work without having to count the spaces. – Wiktor Stribiżew May 16 '17 at 09:50
  • in other files there are a single space separating the columns, but in this case the values under the header will not have a value that contains spaces, so a single space will be the delimiter for columns – sam May 16 '17 at 10:01

4 Answers4

1

You can use \\s+ for split multi spaces with the limit param to limit the split results size., like:

scala> "jOHN     1000     married to Emma".split("\\s+", 3)
res5: Array[String] = Array(jOHN, 1000, married to Emma)
chengpohi
  • 14,064
  • 1
  • 24
  • 42
  • It will split `married to Emma` value, too. Your suggestion does not differ from what OP is already using. – Wiktor Stribiżew May 16 '17 at 09:27
  • Thanks for the quick reply, but I need away to get the space count between columns. – sam May 16 '17 at 09:29
  • @WiktorStribiżew and @sam, sorry, I have updated my answer by using `split` with `limit` param to limit the `split` results size. – chengpohi May 16 '17 at 09:39
  • So, you assume a user name cannot have spaces. Order ID should not have spaces, I guess. @sam, is that the case? – Wiktor Stribiżew May 16 '17 at 09:41
  • @chengpohi: Another file might contain a different number of columns so I can't hard code the number of splits. so if the file contains an address "jOHN 1000 231 any street married to Emma".split("\\s+", 3) res2: Array[String] = Array(jOHN, 1000, 231 any street married to Emma) – sam May 16 '17 at 09:44
  • @WiktorStribiżew: No that isn't the case, I'm writing something that is generic, but does depend on the spaces between header colums, which might vary from file to another – sam May 16 '17 at 09:46
  • 2
    @sam: Please check my comments to the question. We could all spare this meaningless conversation if you posted real code in the first place and stated exact output you seek. – Wiktor Stribiżew May 16 '17 at 09:47
1

I've hacked together some code to get your spaces. Kind of long winded but it works.

Borrowing @Kevin Wright split function from here

def split[T](list: List[T]) : List[List[T]] = list match {
  case Nil => Nil
  case h::t => val segment = list takeWhile {h ==}
    segment :: split(list drop segment.length)
}

You can go:

scala> val line = "JOHN     1000     married to Emma"
line: String = JOHN     1000     married to Emma

scala> val lengthOfSpaces = split(line.toCharArray.toList).
     | filter(x => x.head.equals(' ') && x.size > 1).
     | map(y => y.length).
     | distinct.head
lengthOfSpaces: Int = 5

scala> line.split(" " * lengthOfSpaces)
res39: Array[String] = Array(JOHN, 1000, married to Emma)

Will also work if you have extra columns:

scala> val line2 = "jOHN     1000     231 any street     married to Emma"
line2: String = jOHN     1000     231 any street     married to Emma

scala> line2.split(" " * lengthOfSpaces)
res47: Array[String] = Array(jOHN, 1000, 231 any street, married to Emma)

I've made the assumption that the spaces between columns will be a uniform value in EACH line. So you can't have 5 spaces between user_name and order_id, and 4 spaces between order_id and the next column.

Also, if you're going to have data with same number of spaces between columns and words, perhaps you should normalise your data first. @jan0sch had earlier suggested the spaces be done with tabs.

Community
  • 1
  • 1
airudah
  • 1,169
  • 12
  • 19
0

I've edited the solution to reflect your actual problem. This is somewhat overkill but will solve it. First we analyse the header line to calculate the spaces. For that we assume that you know the number of columns. Then the rest is simply building the appropriate split parameter.

@ val h = "user_name     order_id     M_Status" 
h: String = "user_name     order_id     M_Status"
@ val c = (h.split("\\s+").fold("")(_ ++ _).length - h.foldLeft(0)((a, b) => if (b == ' ') a + 1 else a)) / 3                                                                                                        
c: Int = 5                                                                                               
@ " jOHN     1000     married to Emma".split(s" {$c}") 
res18: Array[String] = Array(" jOHN", "1000", "married to Emma")

Better would be to also calculate the number of columns though...

jan0sch
  • 62
  • 6
0

You could just use regex to split the line at more than 1 space cases. No need to count them though.

scala> "jOHN Doe     1000     married to Emma".split("""[\s]{2,}""")
res1: Array[String] = Array(jOHN Doe, 1000, married to Emma)
kikulikov
  • 2,512
  • 4
  • 29
  • 45