-1

I have multiple .csv files of various departments, with same header (Name, Full Address, Department ), Full address is a long string around 20 - 50 words , it includes all details including city. Each of my .CSV file contain around 100K rows, and i have around 400 such files in same folder.

What i want to do is that i want to create a separate csv file for all the employees (after reading all csv files) who lives in New York, (the resultant file should contain Name, full address and department) manually it is not possible but i know it is possible in python.

Can anyone help me out with this?

Mohd Naved
  • 358
  • 3
  • 12
  • have you already created a dataframe ? if you need to parse out the city please add a sample of your data please read [mcve] – Umar.H Jun 10 '20 at 15:29
  • I have not created the dataframe or any code yet. m just a beginner. – Mohd Naved Jun 10 '20 at 15:34
  • One possible way: generate list of csv files in directory. Create empty list y. Iterate over list of files. Read files line by line using csv module (https://stackoverflow.com/a/13428432/503835). Check if string "new york" exists in line, if not a duplicate, store in y. After done reading convert to dataframe and save to csv. – eNc Jun 10 '20 at 15:34
  • @eNc I have to consider the every file is of around 400 MB, and there are 400 such files.. so reading all one by one , will it crash my system? – Mohd Naved Jun 10 '20 at 15:36

1 Answers1

1

As suggested, due to the size you probably should just parse all the data a file at a time and write out any matching rows to a new CSV file. This approach does not require whole files to be loaded and so will work with any size.

This can be done as follows:

import csv
import glob

write_header = True
output_csv = 'new york.csv'     # assume this is not already used

with open(output_csv, 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)

    for csv_filename in glob.glob('*.csv'):
        if csv_filename != output_csv:
            with open(csv_filename) as f_input:
                csv_input = csv.reader(f_input)
                header = next(csv_input)

                if write_header:
                    csv_output.writerow(header)
                    write_header = False

                for row in csv_input:
                    if "new york" in row[1].lower():
                        csv_output.writerow(row)

It should produce an output new york.csv file containing any rows where the address column contained the word new york. Obviously you might need to improve this test to cover badly formed addresses.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97