0

I imported a local csv file into my spark scala project. I have to calculate some statistics from the data. One of the calculations involve grouping the data by the date field and counting the occurrences by id. The second calculation involves showing the top 100 names in the data that appear frequently.

Any help would be appreciated on these items.

For the first one, I am able to create the "Month" column, but null values are there instead of just the numeric month so that I can group by the new "Month" column. Can someone help with this?

TestData_csv.withColumn("Month",date_format(to_date(col("date")), "MM")).show(false)
Dmytro Mitin
  • 48,194
  • 3
  • 28
  • 66
Kacie
  • 3
  • 4
  • 1
    You questions is not very clear, you should add dimple data for input and desired output – Abdennacer Lachiheb Dec 19 '22 at 23:52
  • Id Date Fname Lname Route 1 1/1/2017 Greg Rollo S 2 1/2/2017 Frankie Allen N 2 1/3/2017 Denise Smith E 2 2/11/2017 Earlene Lane W 2 3/22/2017 Donovan Joseph SE 3 3/11/2017 Rebecca Bowls NE 3 4/5/2017 Katherine Cook NW 4 5/16/2017 Alicia Mason SW 4 6/9/2017 Bob Peterson S 4 6/30/2017 Janet Love N 5 7/2/2017 Richard Dingle E 5 8/25/2017 Thomas Velasquez W 5 8/10/2017 Susan King SE 1 9/25/2017 Pratesh Venkat NE 1 10/14/2017 Neha Kumar NW 1 11/2/2017 Louis Williams SW 3 11/28/2017 Winston McLean E 3 12/2/2017 Clark Kent N 3 12/14/2017 Bruce Wayne S – Kacie Dec 20 '22 at 00:15
  • expected output – Kacie Dec 20 '22 at 00:21
  • Month Number of Routes Jan 3 Feb 1 Mar 2 Apr 1 May 1 Jun 2 Jul 1 Aug 2 Sep 1 Oct 1 Nov 2 Dec 2 @AbdennacerLachiheb – Kacie Dec 20 '22 at 00:21

1 Answers1

0

You were almost there!

The problem in your case was that the to_date function you were using did not understand what format the strings were written in. There is also a to_date function that accepts a fmt string to make Spark understand how to parse your strings.

That would looks something like this:

import spark.implicits._

// This is just the data you posted in a comment on your question
val df = spark.read.option("sep", ";").option("header", "true").csv("./dateTimeCSV.csv")

df.show
+---+----------+---------+---------+-----+                                                                                                                                                                                                                                      
| Id|      Date|    Fname|    Lname|Route|                                                                                                                                                                                                                                      
+---+----------+---------+---------+-----+                                                                                                                                                                                                                                      
|  1|  1/1/2017|     Greg|    Rollo|    S|                                                                                                                                                                                                                                      
|  2|  1/2/2017|  Frankie|    Allen|    N|                                                                                                                                                                                                                                      
|  2|  1/3/2017|   Denise|    Smith|    E|                                                                                                                                                                                                                                      
|  2| 2/11/2017|  Earlene|     Lane|    W|                                                                                                                                                                                                                                      
|  2| 3/22/2017|  Donovan|   Joseph|   SE|                                                                                                                                                                                                                                      
|  3| 3/11/2017|  Rebecca|    Bowls|   NE|                                                                                                                                                                                                                                      
|  3|  4/5/2017|Katherine|     Cook|   NW|                                                                                                                                                                                                                                      
|  4| 5/16/2017|   Alicia|    Mason|   SW|                                                                                                                                                                                                                                      
|  4|  6/9/2017|      Bob| Peterson|    S|                                                                                                                                                                                                                                      
|  4| 6/30/2017|    Janet|     Love|    N|                                                                                                                                                                                                                                      
|  5|  7/2/2017|  Richard|   Dingle|    E|                                                                                                                                                                                                                                      
|  5| 8/25/2017|   Thomas|Velasquez|    W|                                                                                                                                                                                                                                      
|  5| 8/10/2017|    Susan|     King|   SE|                                                                                                                                                                                                                                      
|  1| 9/25/2017|  Pratesh|   Venkat|   NE|                                                                                                                                                                                                                                      
|  1|10/14/2017|     Neha|    Kumar|   NW|                                                                                                                                                                                                                                      
|  1| 11/2/2017|    Louis| Williams|   SW|                                                                                                                                                                                                                                      
|  3|11/28/2017|  Winston|   McLean|    E|                                                                                                                                                                                                                                      
|  3| 12/2/2017|    Clark|     Kent|    N|                                                                                                                                                                                                                                      
|  3|12/14/2017|    Bruce|    Wayne|    S|                                                                                                                                                                                                                                      
+---+----------+---------+---------+-----+

val output = df.withColumn("Month",date_format(to_date(df("Date"), "MM/dd/yyyy"), "MM"))

output.show
+---+----------+---------+---------+-----+-----+                                                                                                                                                                                                                                
| Id|      Date|    Fname|    Lname|Route|Month|                                                                                                                                                                                                                                
+---+----------+---------+---------+-----+-----+                                                                                                                                                                                                                                
|  1|  1/1/2017|     Greg|    Rollo|    S|   01|                                                                                                                                                                                                                                
|  2|  1/2/2017|  Frankie|    Allen|    N|   01|                                                                                                                                                                                                                                
|  2|  1/3/2017|   Denise|    Smith|    E|   01|                                                                                                                                                                                                                                
|  2| 2/11/2017|  Earlene|     Lane|    W|   02|                                                                                                                                                                                                                                
|  2| 3/22/2017|  Donovan|   Joseph|   SE|   03|                                                                                                                                                                                                                                
|  3| 3/11/2017|  Rebecca|    Bowls|   NE|   03|                                                                                                                                                                                                                                
|  3|  4/5/2017|Katherine|     Cook|   NW|   04|                                                                                                                                                                                                                                
|  4| 5/16/2017|   Alicia|    Mason|   SW|   05|                                                                                                                                                                                                                                
|  4|  6/9/2017|      Bob| Peterson|    S|   06|                                                                                                                                                                                                                                
|  4| 6/30/2017|    Janet|     Love|    N|   06|                                                                                                                                                                                                                                
|  5|  7/2/2017|  Richard|   Dingle|    E|   07|                                                                                                                                                                                                                                
|  5| 8/25/2017|   Thomas|Velasquez|    W|   08|                                                                                                                                                                                                                                
|  5| 8/10/2017|    Susan|     King|   SE|   08|                                                                                                                                                                                                                                
|  1| 9/25/2017|  Pratesh|   Venkat|   NE|   09|                                                                                                                                                                                                                                
|  1|10/14/2017|     Neha|    Kumar|   NW|   10|                                                                                                                                                                                                                                
|  1| 11/2/2017|    Louis| Williams|   SW|   11|                                                                                                                                                                                                                                
|  3|11/28/2017|  Winston|   McLean|    E|   11|                                                                                                                                                                                                                                
|  3| 12/2/2017|    Clark|     Kent|    N|   12|                                                                                                                                                                                                                                
|  3|12/14/2017|    Bruce|    Wayne|    S|   12|                                                                                                                                                                                                                                
+---+----------+---------+---------+-----+-----+

As you can see, I added a fmt string with as value "MM/dd/yyyy" to make Spark understand how to parse the date.

Hope this helps!

Koedlt
  • 4,286
  • 8
  • 15
  • 33
  • I tried adding your code and now I am getting this error: ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. – Kacie Dec 20 '22 at 19:04
  • That's a different issue. Try out [this](https://stackoverflow.com/questions/35652665/java-io-ioexception-could-not-locate-executable-null-bin-winutils-exe-in-the-ha) for your winutils issue. – Koedlt Dec 20 '22 at 19:06
  • Thank you!! I would like the result set to show 1 row for each month. I only need to display the distinct month and the count of the routes – Kacie Dec 20 '22 at 19:08
  • That is also a different question :) Here on Stackoverflow, we limit each question post to a single for reasons explained [here](https://meta.stackexchange.com/a/222741/1291568). If you feel your question about creating a numeric month column is answered, don't hesitate to mark this answer as accepted. If you have another question, feel free to make a new SO question :) – Koedlt Dec 20 '22 at 19:12
  • But your question is a very basic one - a google search should suffice (`groupBy` and `count` will do the trick) – Koedlt Dec 20 '22 at 19:13