3

I have six .csv files. They overall size is approximately 4gigs. I need to clean each and do some data analysis task on each. These operations are the same for all the frames. This is my code for reading them.

#df = pd.read_csv(r"yellow_tripdata_2018-01.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-02.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-03.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-04.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-05.csv")
df = pd.read_csv(r"yellow_tripdata_2018-06.csv")

Each time I run the kernel, I activate one of the files to be read. I am looking for a more elegant way to do this. I thought about doing a for-loop. Making a list of file names and then reading them one after the other but I don't want to merge them together so I think another approach must exist. I have been searching for it but it seems all the questions lead to concatenating the files read at the end.

parastoo91
  • 151
  • 1
  • 12

5 Answers5

3

Use for and format like this. I use this every single day:

number_of_files = 6

for i in range(1, number_of_files+1):
    df = pd.read_csv("yellow_tripdata_2018-0{}.csv".format(i)))

    #your code here, do analysis and then the loop will return and read the next dataframe
seralouk
  • 30,938
  • 9
  • 118
  • 133
3

You could use a list to hold all of the dataframes:

number_of_files = 6
dfs = []

for file_num in range(len(number_of_files)):
    dfs.append(pd.read_csv(f"yellow_tripdata_2018-0{file_num}.csv")) #I use Python 3.6, so I'm used to f-strings now. If you're using Python <3.6 use .format()

Then to get a certain dataframe use:

df1 = dfs[0]

Edit:

As you are trying to keep from loading all of these in memory, I'd resort to streaming them. Try changing the for loop to something like this:

for file_num in range(len(number_of_files)):
    with open(f"yellow_tripdata_2018-0{file_num}.csv", 'wb') as f:
        dfs.append(csv.reader(iter(f.readline, '')))

Then just use a for loop over dfs[n] or next(dfs[n]) to read each line into memory.

P.S.

You may need multi-threading to iterate through each one at the same time.

Loading/Editing/Saving: - using csv module

Ok, so I've done a lot of research, python's csv module does load one line at a time, it's most likely in the mode we are opening it in. (explained here)

If you don't want to use Pandas (which chunking may honestly be the answer, just implement that into @seralouk's answer if so), otherwise, then yes! This below is in my mind would be the best approach, we just need to change a couple of things.

number_of_files = 6
filename = "yellow_tripdata_2018-{}.csv"

for file_num in range(number_of_files):
    #notice I'm opening the original file as f in mode 'r' for read only
    #and the new file as nf in mode 'a' for append
    with open(filename.format(str(file_num).zfill(2)), 'r') as f,
         open(filename.format((str(file_num)+"-new").zfill(2)), 'a') as nf:
        #initialize the writer before looping every line
        w = csv.writer(nf)
        for row in csv.reader(f):
            #do your "data cleaning" (THIS IS PER-LINE REMEMBER)
        #save to file
        w.writerow(row)

Note:

You may want to consider using a DictReader and/or DictWriter, I'd prefer them over regular reader/writers as I find them easier to understand.

Pandas Approach - using chunks

PLEASE READ this answer - if you'd like to steer away from my csv approach and stick with Pandas :) It literally seems like it's the same issue as yours and the answer is what you're asking for.

Basically Panda's allows for you to partially load a file as chunks, execute any alterations, then you can write those chunks to a new file. Below is majorly from that answer but I did do some more reading up myself in the docs

number_of_files = 6
chunksize = 500 #find the chunksize that works best for you
filename = "yellow_tripdata_2018-{}.csv"

for file_num in range(number_of_files):
    for chunk in pd.read_csv(filename.format(str(file_num).zfill(2))chunksize=ch)
        # Do your data cleaning
        chunk.to_csv(filename.format((str(file_num)+"-new").zfill(2)), mode='a') #see again we're doing it in append mode so it creates the file in chunks

For more info on chunking the data see here as well it's good reading for those such as yourself getting headaches over these memory issues.

Jab
  • 26,853
  • 21
  • 75
  • 114
  • but I don't want to keep them all in the memory at the same time. The accumulate to almost 4 gigs. – parastoo91 Nov 01 '18 at 12:03
  • I'd refer to [this](https://stackoverflow.com/a/6556862/225020) answer. What you're looking for then is streaming the loading of the csv file. Although to do all files at the same time you'll need to use multi-threading – Jab Nov 01 '18 at 12:11
  • Edited my answer. Hopefully my code makes sense, please advise if not. GL – Jab Nov 01 '18 at 12:25
  • Thank you very much. I will use this strategy but is it possible to go with the for loop and formatting. Like the other answers have suggested? I just get confused about how working in a notebook is when I create a for-loop. in the first iteration, I read the first data-frame. Now, all the other cells are considered as operations inside the loop? – parastoo91 Nov 01 '18 at 12:32
  • For loop and formatting? What do you mean formatting? – Jab Nov 01 '18 at 12:36
  • @parastoo91 see my answer below. No need to load anything in memory – seralouk Nov 01 '18 at 12:45
  • He said the files are 4 GB each. I assume this will not fit into memory for him as it's too large. Streaming should only load one line at a time into memory if I'm not mistaken. – Jab Nov 01 '18 at 13:09
  • what I want to do is this: read the data-frame, do the data cleaning, put the resulted data frame in a new .csv file. Do these operations on the rest of the files I have. I don't want to get memory error so I need to read these dataframes one by one. – parastoo91 Nov 01 '18 at 14:08
  • @parastoo91 I edited. I did a crap load of research on it as I had minimal experience as deep as I thought needed originally. I believe that last edit should help you! If not, feel free to let me know! – Jab Nov 01 '18 at 15:45
  • Ok, last edit I hope, I included all the pandas stuff I found. Chunksize is what is need for that! Have a look – Jab Nov 01 '18 at 16:16
2

Use glob.glob to get all files with similar names:

import glob
files = glob.glob("yellow_tripdata_2018-0?.csv")
for f in files:
    df = pd.read_csv(f)
    # manipulate df
    df.to_csv(f)

This will match yellow_tripdata_2018-0<any one character>.csv. You can also use yellow_tripdata_2018-0*.csv to match yellow_tripdata_2018-0<anything>.csv or even yellow_tripdata_*.csv to match all csv files that start with yellow_tripdata.

Note that this also only loads one file at a time.

kuropan
  • 774
  • 7
  • 18
  • what files contains? is it a list of the names of my files? – parastoo91 Nov 01 '18 at 13:09
  • yes, files contains the list of the files that match the pattern (you could try that by ``print(files)``). In the pattern ``?`` matches any **one** character (including ``?`` itself) and ``*`` matches as many characters as possible. – kuropan Nov 01 '18 at 13:13
0

Use os.listdir() to make a list of files you can loop through?

samplefiles = os.listdir(filepath)
for filename in samplefiles:
    df = pd.read_csv(filename)

where filepath is the directory containing multiple csv's?

Or a loop that changes the filename:

for i in range(1, 7):
    df = pd.read_csv(r"yellow_tripdata_2018-0%s.csv") % ( str(i))
Pallie
  • 965
  • 5
  • 10
  • Thank you. How does this works? I have now the first data frame in memory. Then I run all the other cells. After all the cells executed, notebooks automatically reads the second dataframe in the files folder? – parastoo91 Nov 01 '18 at 12:05
0
# import libraries 
import pandas as pd
import glob

# store file paths in a variable

project_folder = r"C:\file_path\"

# Save all file path in a variable
 
all_files_paths = glob.glob(project_folder + "/*.csv")

# Create a list to save whole data
li = []

# Use list comprehension to iterate over all files; and append data in each file to list

list_all_data = [li.append(pd.read_csv(filename, index_col=None, header=0)) for filename in all_files]

# Convert list to pandas dataframe
df = pd.concat(li, axis=0, ignore_index=True)