3

I'm trying to read a list of files into a list of Pandas DataFrames in Python. However, the code below doesn't work.

files = [file1, file2, file3]

df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()

dfs = [df1, df2, df3]

# Read in data files
for file,df in zip(files, dfs):
    if file_exists(file):
        with open(file, 'rb') as in_file:
            df = pd.read_csv(in_file, low_memory=False)
            print df        #the file is getting read properly

print df1    #empty
print df2    #empty
print df3    #empty

How to I get the original DataFrames to update if I pass them into a for-loop as a list of DataFrames?

Aliz Rao
  • 41
  • 1
  • 5
  • what is your goal to read those files into three DFs or to merge them together into single DF? – MaxU - stand with Ukraine Mar 16 '16 at 20:53
  • 1
    You're updating the iterable rather than what the element is, it would be the same thing if you iterated over a list. Is there a reason you need to construct the empty dfs upfront rather than just set `dfs=[]` and then just iterate over the files and do `dfs.append(pd.read_csv(in_file)`? – EdChum Mar 16 '16 at 20:54
  • 1
    When you iterate over a list, you can't modify the element directly. The line `df = pd.read_csv(in_file, low_memory=False)` is not actually modifying the elements in the list, it's modifying a copy of the element. EDIT: beat me to it @EdChum :) – Alfredo Gimenez Mar 16 '16 at 20:54
  • see http://stackoverflow.com/questions/1207406/remove-items-from-a-list-while-iterating-in-python for an explanation, but essentially you can just skip this and just append to a list your created dfs – EdChum Mar 16 '16 at 20:56
  • @MaxU: I was trying to read files into three different DFs. The idea was that by using a loop, one could concisely import any number of files. – Aliz Rao Mar 21 '16 at 21:23

6 Answers6

4

Try this:

dfs = [pd.read_csv(f, low_memory=False) for f in files]

if you want to check whether file exists:

import os

dfs = [pd.read_csv(f, low_memory=False) for f in files if os.path.isfile(f)]

and if you want to concatenate all of them into one data frame:

df = pd.concat([pd.read_csv(f, low_memory=False)
                for f in files if os.path.isfile(f)],
               ignore_index=True)
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
1

You are not working on the list elements themselves when iterating over them but you are not operating on the list.

You need to insert the elements (or append them) to the list. One possibility could be:

files = [file1, file2, file3]

dfs = [None] * 3 # Just a placeholder

# Read in data files
for i, file in enumerate(files): # Enumeration instead of zip
    if file_exists(file):
        with open(file, 'rb') as in_file:
            dfs[i] = pd.read_csv(in_file, low_memory=False) # Setting the list element
            print dfs[i]      #the file is getting read properly

This updates the list elements and should work.

MSeifert
  • 145,886
  • 38
  • 333
  • 352
1

Your code seems over complicated you can just do:

files = [file1, file2, file3]

dfs = []

# Read in data files
for file in files:
    if file_exists(file):
        dfs.append(pd.read_csv(file, low_memory=False))

You will end up with a list of dfs as desired

EdChum
  • 376,765
  • 198
  • 813
  • 562
0

You can try list comprehension:

files = [file1, file2, file3]

dfs = [pd.read_csv(x, low_memory=False) for x in files if file_exists(x)]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Custom-written Python function that appropriately handles both CSV & JSON files.

def generate_list_of_dfs(incoming_files):
    """
    Accepts a list of csv and json file/path names.
    Returns a list of DataFrames.
    """
    outgoing_files = []
    for filename in incoming_files:
        file_extension = filename.split('.')[1]
        if file_extension == 'json':
            with open(filename, mode='r') as incoming_file:
                outgoing_json = pd.DataFrame(json.load(incoming_file))
                outgoing_files.append(outgoing_json)
        if file_extension == 'csv':
            outgoing_csv = pd.read_csv(filename)
            outgoing_files.append(outgoing_csv)
    return outgoing_files

How to Call this Function

import pandas as pd
import json    
files_to_be_read = ['filename1.json', 'filename2.csv', 'filename3.json', 'filename4.csv']
dataframes_list = generate_list_of_dfs(files_to_be_read)
Zernach
  • 95
  • 1
  • 1
  • 11
0

Here is a simple solution that avoids using a list to hold all the data frames, if you don't need them in a list.

import fnmatch

# get the CSV files only
files = fnmatch.filter(os.listdir('.'), '*.csv')
files

Output which is now a list of the names:

['Feedback Form Submissions 1.21-1.25.22.csv',
 'Feedback Form Submissions 1.21.22.csv',
 'Feedback Form Submissions 1.25-1.31.22.csv']

Now create a simple list of new names to make working with them easier:

# use a simple format
names = []
for i in range(0,len(files)):
    names.append('data' + str(i))
names

['data0', 'data1', 'data2']

You can use any list of names that you want. The next step take the file names and the list of names and then assign them to the names.

# i is the incrementor for the list of names
i = 0

# iterate through the file names
for file in files:
    # make an empty dataframe
    df = pd.DataFrame()
    # load the first file in
    df = pd.read_csv(file, low_memory=False)
    # get the first name from the list, this will be a string
    new_name = names[i]
    # assign the string to the variable and assign it to the dataframe 
    locals()[new_name] = df.copy()
    # increment the list of names
    i = i + 1

You now have 3 separate dataframes named data0, data1, data2, and do commands like

data2.info()
Bryan Butler
  • 1,750
  • 1
  • 19
  • 19