0

I have csv file with almost 300k records. It looks like this

Id, Name, City
1   Alex New York
2   Alex Florida
3   John New York
...
300k Alex New York

Well, I need to parse it and create 2 separate files with 500 most frequent columns(name and city). For example, in 300k records, I have 20k Alex(name), 1k John(name), 25k Fred(name) and make separate csv file with 500 most frequent names descendingly

Id, Name
1. Fred
2. Alex
3. John

The same with the cities How to do it best?

  • you have to use a map of values and keep updating the map with the count of the occurrence and then after that sort that map bases on the value using comparator – Rahul Singh May 25 '17 at 11:01

1 Answers1

0

It seems like a job for Apache Spark. Maybe it is a bit overkill to use such a heavy framework, but it offers elegant and scalable solution to process big data (even for relatively smaller data like 300K).

So here is one solution:

SparkSession spark = SparkSession.builder()
                  .master("local")
                  .appName("CSV Processor")
                  .getOrCreate();
Dataset<Row> df = spark.read()
                  .option("header", "true")
                  .csv("input.csv");
df.printSchema();
Dataset<Row> grouped = df.groupBy("Name")
       .count()
       .sort(org.apache.spark.sql.functions.col("count").desc())
       .limit(500);
grouped.show();
grouped.coalesce(1)
       .write()
       .mode("overwrite")
       .option("header", "true")
       .csv("output.csv");

Dependencies:

org.apache.spark:spark-core_2.10:2.1.1
org.apache.spark:spark-sql_2.10:2.1.1

Windows cavet: Setting winutils.exe

pufface
  • 263
  • 3
  • 11