Splitting/reading CSV file by distinct row

Question

I have a csv file with 3 columns.

Key,Branch,Account 
a,213,234567
a,454,457900
a,562,340094
a,200,456704
b,400,850988
b,590,344433
c,565,678635
c,300,453432
c,555,563546
c,001,660905

I would like to iterate through each row and get distinct rows from the Key column (a,b & c) and split them into 3 different pyspark datagrams.

   a,213,234567
   a,454,457900
   a,562,340094
   a,200,456704


   b,400,850988
   b,590,344433


   c,565,678635
   c,300,453432
   c,555,563546
   c,001,660905

is the output correct? There are 4 rows for a but output has 3 rows. — Shivam Seth, Feb 28 '20 at 17:35
if you are trying to save the different dataframes as different files in a file system, look at my answer here, https://stackoverflow.com/questions/60048027/how-to-manage-physical-data-placement-of-a-dataframe-across-the-cluster-with-pys/60048672#60048672. python/pandas solutions are not for big data. — murtihash, Feb 28 '20 at 19:47
Alright, what’s the problem? Have you actually tried anything, done any research? — AMC, Feb 28 '20 at 20:47

score 0 · Answer 1 · answered Feb 28 '20 at 17:33

Something like this?

csv_string = """Key,Branch,Account 
a,213,234567
a,454,457900
a,562,340094
a,200,456704
b,400,850988
b,590,344433
c,565,678635
c,300,453432
c,555,563546
c,001,660905"""

import csv
import io

#
# 1. Parse csv_string into a list of ordereddicts
#

def parse_csv(string):
    # if you are reading from a file you don't need to do this
    # StringIO nonsense -- just pass the file to csv.DictReader()
    string_file = io.StringIO(string)
    reader = csv.DictReader(string_file)
    return list(reader)

csv_table = parse_csv(csv_string)

#
# 2. Loop through each line of the table and get the key
#  - If we have seen the key before, put the line in the list
#    with other lines that had the same key
#  - If not, start a new list for that key
#

result = {}

for line in csv_table:
    key = line["Key"].strip()
    print(key, ":", line)
    if key in result:
        result[key].append(line)
    else:
        result[key] = [line]

#
# 3. Finally, print the result.
# The lines will probably be easier to deal with if you keep them 
# in their parsed form, but for readability we can join the values
# of the line back into a string with commas
#

print(result)
print("")

for key_list in result.values():
    for line in key_list:
        print(",".join(line.values()))
    print("")

score 0 · Answer 2 · answered Feb 28 '20 at 19:42

You can use pandas library to do the same and it will also provides you capability to do more operations with minimal code. Please read about pandas here

Here is the code to get desired output. I am storing data in dictionary so you can get desired data using dict[key] ex. dict[a]

import pandas

df = pandas.read_csv("data.csv", delimiter=",")

keys = df["Key"].unique() #This will provide all unique keys from csv

sorted_DF = df.groupby("Key") #Sort data based on value of column Key

dict = {} #To store data based on key
for key in keys:
    dict[key] = sorted_DF.get_group(key).values.tolist()

for key in keys:
    print("{} : {}".format(key, dict[key]))

Output:

a : [['a', 213, 234567], ['a', 454, 457900], ['a', 562, 340094], ['a', 200, 456704]]

b : [['b', 400, 850988], ['b', 590, 344433]]

c : [['c', 565, 678635], ['c', 300, 453432], ['c', 555, 563546], ['c', 1, 660905]]

Splitting/reading CSV file by distinct row

2 Answers2