1

I am merging 3700 csv files with a total of 10 million rows. The files do not have a sequential naming but the date in which they were created(Descending) is sequential. I use the following code to merge them but do not know how to add pick them in that sequence.

import pandas as pd
import glob

path = r'C:\Users\User\' #  path

all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:

    df = pd.read_csv(filename, index_col=None, header=0)

    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

The following are names of files arranged in date modified(oldest to newest) from the os

V100(1005).csv V100(1).csv V100(778).csv V100(2).csv

The file names are not sequential because while downloading other files were deleted in between.

belwood
  • 3,320
  • 11
  • 38
  • 45
Moshe
  • 107
  • 1
  • 9
  • I believe `glob` uses `os`'s default listing, so order is not guaranteed. You probably need to sort `all_files` by manually. – Quang Hoang Oct 18 '20 at 21:54
  • If you want to sort the files by their name, please provide an example of a file name. – EliadL Oct 18 '20 at 22:00
  • @EliadL I do not need to sort the files by their name but date created(modified) but I have supplied the names just for good measure – Moshe Oct 18 '20 at 22:07
  • My bad, I thought the date was incorporated in the file's name. – EliadL Oct 18 '20 at 22:35

1 Answers1

2

If I understand the question correctly, you need this before the loop (based on an answer here), which sorts list of files by creation/modification date in Python 3:

import os
all_files = sorted(all_files, key=os.path.getmtime)
#all_files = sorted(all_files, key=os.path.getctime) #works too

Test on MacOs:

  • I created 5 files similar to OP: enter image description here

  • Using glob to read files

import glob
all_files = glob.glob('V*.csv')
print(all_files)
#['V100-778.csv', 'V100-778-.csv', 'V100-1005-1.csv', 'V100-1.csv', 'V100-2.csv']
  • Sorting the files based on modification time:
import os
all_files = sorted(all_files, key=os.path.getmtime)
print(all_files)
#['V100-778-.csv', 'V100-1005-1.csv', 'V100-1.csv', 'V100-778.csv', 'V100-2.csv']
Grayrigel
  • 3,474
  • 5
  • 14
  • 32
  • Still new to coding. Thank you for your update. The original one did not work since Im using Windows. The updated one worked. Thank you. – Moshe Oct 18 '20 at 22:48
  • Glad it works for you. Please checkmark/accept the answer if it helps. – Grayrigel Oct 18 '20 at 22:49
  • Thanks for the accept. Just FYI, based on my testing, the solution which works for you, worked for MacOS well. – Grayrigel Oct 18 '20 at 23:19
  • Thanks for the infor. Just a quick question though, why cant I read moved csv files using pandas? For example, suppose I download a file, V100.csv and read it without moving it with `df = pd.read_csv('V100.csv')` Then move it to a new folder along with the document such that `df = pd.read_csv('V100.csv')` ? – Moshe Oct 18 '20 at 23:28
  • IIIUC, it is because the path of the files is changed. In that case, you should do `pd.read_csv('Newpath/V100.csv')`. Checkout [this](https://stackoverflow.com/questions/2953834/windows-path-in-python) answer for more information about adding a path. – Grayrigel Oct 18 '20 at 23:36
  • I understand that one very well but it continues to give me this error: `EmptyDataError: No columns to parse from file` – Moshe Oct 19 '20 at 00:24
  • Could be some formatting issues in the CSV files does this help? : https://stackoverflow.com/questions/40193452/importing-text-file-no-columns-to-parse-from-file/40193853 – Grayrigel Oct 19 '20 at 00:31
  • 1
    Yes Thank you.. `pandas.read_csv(filepath_or_buffer, delim_whitespace=True)` Worked. You truly know your stuff – Moshe Oct 19 '20 at 02:13