2

Scenario: I have a list of files in a folder (including the file paths). I am trying to get the content of each of those files into a dataframe (one for each file), then further perform some operations and later merge these dataframes.

From various other questions in SO, I found multiple ways to iterate over the files in a folder and get the data, but all of those I found usually ready the files in a loop and concatenate them into a single dataframe automatically, which does not work for me.

For example:

import os
import pandas as pd
path = os.getcwd()
files = os.listdir(path)
files_xls = [f for f in files if f[-3:] == 'xls*']
df = pd.DataFrame()
for f in files_xls:
    data = pd.read_excel(f, 'Sheet1')
    df = df.append(data)

or

import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("*.xls*"):
    df = pd.read_excel(f)
    all_data = all_data.append(df,ignore_index=True)

The only piece of code I could put together from what I found is:

from os.path import isfile, join
import glob
mypath = "/DGMS/Destop/uploaded"
listoffiles = glob.glob(os.path.join(mypath, "*.xls*"))
contentdataframes = (pd.read_excel(f) for f in listoffiles)

This lines run without error, but they appear not to do anything, no variables or created nor changed.

Question: What am I doing wrong here? Is there a better way to do this?

dorvak
  • 9,219
  • 4
  • 34
  • 43
DGMS89
  • 1,507
  • 6
  • 29
  • 60
  • Did you test your globing-statements, i.e. does the listofffiles contain any filepath or will it stay empty? You might simply add an `print(listoffiles)` statement to inspect the list. – dorvak Jul 09 '18 at 09:53
  • @dorvak the listoffiles contains all the file names properly. If I get one of the entries and assign it to a dataframe manually, it works perfectly. The problem is, I am not being able to do this is a loop for all of them. – DGMS89 Jul 09 '18 at 09:56

1 Answers1

2

You are really close, need join all data by concat from generator:

contentdataframes = (pd.read_excel(f) for f in listoffiles)
df = pd.concat(contentdataframes, ignore_index=True)

If need list of DataFrames:

contentdataframes = [pd.read_excel(f) for f in listoffiles]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • That is the problem, I don't want to concatenate the contents at this point, I want to transform each item into the list of dataframes into a single dataframe, so I can edit them. – DGMS89 Jul 09 '18 at 10:00
  • So `contentdataframes = [pd.read_excel(f) for f in listoffiles]` should work – jezrael Jul 09 '18 at 10:01
  • 1
    @DGMS89 Your code created a generator object, so that`s probably why it didn`t work ;) – dorvak Jul 09 '18 at 10:03
  • @dorvak I nevr know when to properly use () and []. – DGMS89 Jul 09 '18 at 10:05
  • @jezrael Just one question, how could I split each member of this list into a separate dataframe? (so my editing is simplified) – DGMS89 Jul 09 '18 at 10:06
  • @DGMS89 - Because `pd.read_excel(f)` return DataFrame, you get list of DataFrames. – jezrael Jul 09 '18 at 10:07
  • @DGMS89 You might read [this](https://stackoverflow.com/questions/4407873/whats-the-difference-between-vs-vs) as a simple overview. However, in your example, `()` is not interpreted as an tuple assignment like `(1,2,3)`, but as a generator expression, which is somewhat related to a list comprehension `[i for i in "test"]` , see this [question](https://stackoverflow.com/questions/364802/how-exactly-does-a-generator-comprehension-work) – dorvak Jul 10 '18 at 09:09