1

I have 20 excel files, each represent a year, each one of them have 10 sheets of different (but related to each other) data for that year.

How to properly import them all in pandas dataframe for purpose of Data analysis for the whole period?

To illustrate more, for example: Should I use a Dict for each excel file (year), where keys are sheetname and values are sheet content (the data frame)? Or What is the proper method in this scenario?

Edit 1: The data is Our Football league information, each season in a separate excel file, that has multiple sheets (sheets data example: clubs, players, matches, goals, cards...etc). I have 20 years worth of data, I’m about to analyze it but i got overwhelmed by different ways of importing them using pandas.

Thanks

Ji---
  • 115
  • 1
  • 10
  • 2
    does each sheet have the same schema? do you want to have a single dataframe as a result? – Paul H Oct 15 '17 at 01:14
  • 1
    Why not load them up separately and concatenate? – Mad Physicist Oct 15 '17 at 01:16
  • @PaulH Each excel file has the same sheets, but each sheet has different columns structure (some columns are common to reference key values). I want the proper method to import all data so i can manipulate and analyze the data. – Ji--- Oct 15 '17 at 01:19
  • @PaulH Im thinking one dataframe. (I also Updated the question with the details of the data) – Ji--- Oct 15 '17 at 01:26

1 Answers1

0

This should work with ExcelFile and concat. Update based on comment:

import pandas as pd

location1 = r'Location1.xlsx'
location2 = r'Location2.xlsx'

locations = [location1, location2]

frames = []

for loc in locations:
    file = pd.ExcelFile(loc)
    df = file.parse('Sheet1')
    df['source'] = loc.rsplit('\\', 1)[-1]
    frames.append(df)

df = pd.concat(frames)

This method would allow you to loop through your locations and add a source column, which would give the file name - I assume that's the year. Note that you might have better luck creating a dataframe for each sheet instead of one massive dataframe for all sheets, as I am guessing the data structure is different for each sheet. The you can join them as needed using merge.

Also, if all your files are in one location, you might have luck loading them all at once with the glob function. For details, see Import multiple csv files into pandas and concatenate into one DataFrame.

kjmerf
  • 4,275
  • 3
  • 21
  • 29
  • Thanks. You mean i would import each sheet as separate dataframe (200 total sheets), then join them? How do I preserve what excelfile(year) these data are from? – Ji--- Oct 15 '17 at 01:30