Load biggest csv file of a folder in python

Question

I have a folder on my computer, where are several fiels saved. How can I automatically load the biggest file (in terms of size in kb) of it only?

Right now I could use:

#Sort it with the help of windows, biggest file on top and then:
import pandas as pd
df = pd.read_csv(r'C:\...\FileABC.csv') #when I know FileABC is listed on the top

Is there a way to automatically do that in python? Then I could skip the manual adjustment in windows.

how do you define `big` here? number of rows, number of columns, size occupied by the file?? — tidakdiinginkan, May 04 '20 at 08:12
Check this [link](https://stackoverflow.com/questions/6591931/getting-file-size-in-python) - you can obtain file size using the `os` module. `os.stat('filename').st_size` should give you the file size in bytes. `os.listdir('dirname')` should give you a list of all files within a given directory `dirname` — tidakdiinginkan, May 04 '20 at 08:14
Hint: `os.stat` gives the size of a file in its `st_size` member. — Serge Ballesta, May 04 '20 at 08:16

score 2 · Accepted Answer · answered May 04 '20 at 08:55

Try this:

import os
import pandas as pd
basedir = 'C:/Users/viupadhy/Desktop/Stackoverflow'
names = os.listdir(basedir)
paths = [os.path.join(basedir, name) for name in names]
sizes = [(path, os.stat(path).st_size) for path in paths]
file = max(sizes, key=lambda x: x[1])
print(file)

df = pd.read_csv(file[0])
df

Output:

score 1 · Answer 2 · answered May 04 '20 at 10:29

A simple way to do this:

import os


def find_largest_file(path):
    largest = None
    max_size = 0
    for filename in os.listdir(path):
        if os.path.isfile(filename):
            size = os.path.getsize(filename)
            if size > max_size:
                largest = filename
                max_size = size
    return largest


print(find_largest_file(path))
# ... whaterver largest file you have in `path`.

This can be further improved by filtering only .csv extension and the like.

Bahadır Çetin · Answer 3 · 2020-05-04T08:50:31.370

0

You can use something like:

import os
import pandas as pd


folder_path = "C:\\programs\\"

file_list = os.listdir(folder_path)
biggest_file = os.path.join(folder_path, file_list[0])

for file in file_list:
    file_location = os.path.join(folder_path, file)
    size = os.path.getsize(file_location)

    if size > os.path.getsize(biggest_file):
        biggest_file = file_location

df = pd.read_csv(biggest_file)

edited May 04 '20 at 08:50

answered May 04 '20 at 08:41

Bahadır Çetin

29
5

Wouldn't it be more efficient to keep track of the file size along the way? This would spare you an extra `os.path.getsize()` at each iteration – norok2 May 04 '20 at 09:38
Yes, this can be improved, I just wrote for a fast solution and it worked when I tried. Thank you for your advice. – Bahadır Çetin May 04 '20 at 11:08

Load biggest csv file of a folder in python

3 Answers3