The general use case behind the question is to read multiple CSV log files from a target directory into a single Python Pandas DataFrame for quick turnaround statistical analysis & charting. The idea for utilizing Pandas vs MySQL is to conduct this data import or append + stat analysis periodically throughout the day.
The script below attempts to read all of the CSV (same file layout) files into a single Pandas dataframe & adds a year column associated with each file read.
The problem with the script is it now only reads the very last file in the directory instead of the desired outcome being all files within the targeted directory.
# Assemble all of the data files into a single DataFrame & add a year field
# 2010 is the last available year
years = range(1880, 2011)
for year in years:
path ='C:\\Documents and Settings\\Foo\\My Documents\\pydata-book\\pydata-book-master`\\ch02\\names\\yob%d.txt' % year
frame = pd.read_csv(path, names=columns)
frame['year'] = year
pieces.append(frame)
# Concatenates everything into a single Dataframe
names = pd.concat(pieces, ignore_index=True)
# Expected row total should be 1690784
names
<class 'pandas.core.frame.DataFrame'>
Int64Index: 33838 entries, 0 to 33837
Data columns:
name 33838 non-null values
sex 33838 non-null values
births 33838 non-null values
year 33838 non-null values
dtypes: int64(2), object(2)
# Start aggregating the data at the year & gender level using groupby or pivot
total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)
# Prints pivot table
total_births.tail()
Out[35]:
sex F M
year
2010 1759010 1898382