1

I used this solution to read in all CSV files from a Google Drive into a data frame in a Colab notebook using this solution (reading csv file with specific name in Python). Each file has the same naming convention and I want to split the file name into two new columns and append those to the dataframe.

The file name are structured like this: Platform_Company.csv (example Instagram_Microsoft.csv) and I want the columns to be appended at the beginning of the dataframe.

platform company employee id employee email
Instagram Microsoft person 1 humanperson@microsoft.com

So far, I've used this to read in the files. I'm not sure what the layer number is or whether I need it.

from pathlib import Path
import pandas as pd

ls_data = []

csv_directory = '/content/drive/MyDrive/Colab Notebooks/'

for idx, filename in enumerate(Path(csv_directory).glob('*Instagram_*.csv')):
    df_temp = pd.read_csv(filename)
    df_temp.insert(0, 'layer_number', idx)
    ls_data.append(df_temp) 

df = pd.concat(ls_data, axis=0)

I tried incorporating the following script (Read multiple csv files and Add filename as new column in pandas), but it isn't working and I'm not sure how to add it into the current version.

import glob
import os
import pandas as pd

path = r'\OUTPUT'
all_files = glob.glob(os.path.join(path, "*.csv"))     

df_from_each_file = (pd.read_csv(f, delimiter='|') for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)
concatenated_df['filename'] =(all_files[f] for f in all_files)

Thanks for any guidance and/or suggestions!

1 Answers1

0

You can use (Platform, Company) as the key of a dict then use pd.concat to get the expected output:

import pandas as pd
import pathlib

csv_directory = pathlib.Path('/content/drive/MyDrive/Colab Notebooks/')

data = {}
for filename in csv_directory.glob('*Instagram_*.csv'):
    df = pd.read_csv(filename)
    platform, company = filename.stem.split('_')
    data[platform, company] = df

df = (pd.concat(data, axis=0).droplevel(-1)
        .rename_axis(['platform', 'company']).reset_index())

Output:

>>> df
    platform    company employee id             employee email
0  Instagram  Microsoft    person 1  humanperson@microsoft.com
1  Instagram   Facebook    person 2   humanperson@facebook.com
2  Instagram   Facebook    person 3   humanperson@facebook.com

Input files

# Instagram_Microsoft.csv
employee id,employee email
person 1,humanperson@microsoft.com

# Instagram_Facebook.csv
employee id,employee email
person 2,humanperson@facebook.com
person 3,humanperson@facebook.com
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • I want to try just grabbing the text before the underscore and tried removing company, but keep getting errors. It doesn't seem to find all of the files. How can I return just the part of the filename before the underscore to the list? I tried this - s.str.rsplit("/", n=1, expand=True) - but got this error AttributeError: 'PosixPath' object has no attribute 'split' –  Mar 18 '23 at 02:53
  • I also tried modifying it like this: https://pastebin.com/QeLGiyV3 and got this error: 'PosixPath' object has no attribute 'split' –  Mar 18 '23 at 03:17