0

For Part 1, I have multiple csv files which I loop through to create new csv files with just summary statistics (medians). The new csv files have the original filename + 'summary_' at the start. This part is okay.

For Part 2, I want to concatenate all of the 'summary_' files (they have the same column names as each other), but have the row names in the concatenated dataframe the same as the name of the respective 'summary_' csv file where the data comes from.

With stackoverflow's help, I have solved Part 1, but not Part 2 yet. I can concatenate all of the csv files, but not just the ones with 'summary_' in the name (i.e. the new csv's created in Part 1), and not with the correct row names...


import os
import pandas as pd
import glob

## Part 1

summary_stats = ['median']

filenames = (filename for filename in os.listdir(os.curdir) if os.path.splitext(filename)[1] == '.csv')

for filename in filenames:
    df = pd.read_csv(filename, )

    summary_df = df.agg(summary_stats)
    summary_df.to_csv(f'summary_{filename}')

## Part 2

path = r'/Users/Desktop/Practice code'
all_files = glob.glob(path + "/*.csv")

list = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    list.append(df)

frame = pd.concat(list, axis=0, ignore_index=True)

2 Answers2

0
  • Please make sure that the all_files is only loading the files matching "summary_*.csv"

  • Then, you can append rows from one dataframe to another using df.append()

So your code might look something like this

path = r'/Users/Desktop/Practice code'
all_files = glob.glob(path + "/summary_*.csv")

summary_df = None

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    df['row'] = filename.split('summary_')[1].split('.csv')[0]
    df.set_index('row')

    if summary_df is None:
        summary_df = df
    else:
        summary_df = summary_df.append(df)
Ankur
  • 31
  • 2
  • Thanks! That has worked really well. When I print "all_files" the following appears: ['/Users/Desktop/Practice code/summary_Case_1.csv', '/Users/Desktop/Practice code/summary_Case_2.csv', '/Users/Desktop/Practice code/summary_Case_3.csv'] How do I assign the names of those files (e.g. "summary_Case_1") as the respective row names in the concatenated dataframe? – noobpython Mar 27 '19 at 22:29
  • Please see the edited code, i havent tested it but it should work. – Ankur Mar 28 '19 at 01:19
0

Introducing pathlib.Path, an object-oriented interface to paths that will simplify your life.

Keeping your logic, you can use glob directly on Path objects, then combine with pandas.concat() to concatenate dataframes as you load your csv files.

import pandas
from pathlib import Path

srcdir = Path(r'/Users/Desktop/Practice code')              # Get current working directory

df = pd.concat(pd.read_csv(file, index_col=None, header=0)  # Concatenate dataframes from generator
               for file in srcdir.glob('summary_*.csv'),    # Using pathlib.Path's glob
               axis=0, ignore_index=True)                   # Keeping your concat settings

Regarding the row names, you should integrate in your question an extract of your summary_*.csv files and the desired row name.

FabienP
  • 3,018
  • 1
  • 20
  • 25
  • SyntaxError: Generator expression must be parenthesized – Shawn Mar 15 '20 at 19:04
  • Thanks for pointing this error. I think it is new, either changed from new version of Python or Pandas (but I didn't located the change yet). You should be able to fix this by adding parenthesis around the first argument. Please do not hesitate to submit an edit to my answer this fix is working. – FabienP Mar 15 '20 at 19:29