Loop through CSV files to pull out specific columns

Question

I have multiple CSV files in the same directory that contain different survey response data (different questions, different order of questions). What I'm looking to achieve is a loop through all of the CSVs to find specific column headings and store the results in a pandas dataframe.

What I have so far:

import pandas as pd
import csv
import os
import glob

path = "file/path/"
all_files = glob.glob(os.path.join(path, "*.csv")) #make list of paths

for file in all_files:
    file_name = os.path.splitext(os.path.basename(file))
    dfn = pd.read_csv(file, encoding='latin1')
    dfn.index.name = file_name

So the code currently reads in all of the CSVs from the directory, now I think I need another loop to go through them to find the data within a column. The column in question I'm looking for contains the text "would recommend" (there's a possibility not all column names will be worded the same so would need to be contains). I am quite new to Python and really struggling, any help is greatly appreciated.

Example of CSV1:

Programme,"Overall, I am satisfied with the quality of the programme",I would recommend the company to a friend or colleague,Please comment on any positive aspects of your experience of this programme
Nursing,4,4,[IMAGE]
Nursing,1,3,very good
Nursing,4,5,I enjoyed studying tis programme

Example of CSV2:

Programme,I would recommend the company to a friend,The programme was well organised and running smoothly,It is clear how students' feedback on the programme has been acted on
IT,4,2,4
IT,5,5,5
IT,5,4,5

Check this answer -> https://stackoverflow.com/a/11531402/8150371 — Stack, Sep 30 '20 at 19:08
Can you give an example of two CSVs and the data you want the final dataframe to hold? — rcriii, Sep 30 '20 at 19:15
@rcriii I've added a simplified version of the two CSVs in the question. I want the dataframe to hold the programme and the 'would recommend' question responses — willd9, Sep 30 '20 at 19:30
Would be much nicer if they were text I could copy and paste. — rcriii, Sep 30 '20 at 19:49
Apologies, I'm not sure how to add the actual CSVs. I thought it would be to just give you an idea of the data! — willd9, Sep 30 '20 at 19:55
Open them in a text file then paste into your question just like code. — rcriii, Sep 30 '20 at 19:59

score 1 · Answer 1 · answered Sep 30 '20 at 20:34

I'd change the name of the columns to a common value, then concat them together, uing the join parameter to specify that you only want the common columns.

import pandas as pd
from io import StringIO

csv1 = StringIO("""Programme,"Overall, I am satisfied with the quality of the programme",I would recommend the company to a friend or colleague,Please comment on any positive aspects of your experience of this programme
Nursing,4,4,IMAGE
Nursing,1,3,very good
Nursing,4,5,I enjoyed studying tis programme""")

csv2 = StringIO("""Programme,I would recommend the company to a friend,The programme was well organised and running smoothly,It is clear how students' feedback on the programme has been acted on
IT,4,2,4
IT,5,5,5
IT,5,4,5""")

dfout = pd.DataFrame(columns=['Programme', 'Recommends'])

for file in [csv1, csv2]:
    dfn = pd.read_csv(file)
    matching = [s for s in dfn.columns if "would recommend" in s]
    if matching:
        dfn = dfn.rename(columns={matching[0]:'Recommends'})
        dfout = pd.concat([dfout, dfn], join="inner")

print(dfout)

Programme Recommends
0   Nursing          4
1   Nursing          3
2   Nursing          5
0        IT          4
1        IT          5
2        IT          5

This works for the example you've given, but when I attempt to read in a CSV and run the loop through it, I get an error message: ValueError(msg.format(_type=type(filepath_or_buffer))) ValueError: Invalid file path or buffer object type: — willd9, Oct 01 '20 at 08:29
Here is the code I attempted: df = pd.read_csv('data.csv') dfout = pd.DataFrame(columns=['Subunit', 'Recommends']) for file in [df]: dfn = pd.read_csv(file) matching = [s for s in dfn.columns if "would recommend" in s] if matching: dfn = dfn.rename(columns={matching[0]:'Recommends'}) dfout = pd.concat([dfout, dfn], join="inner") print(dfout) — willd9, Oct 01 '20 at 08:30

score 0 · Answer 2 · answered Sep 30 '20 at 19:24

0

You do not need to loop:

matching = [s for s in dfn.columns if "would recommend" in s]

in 'matching' you will find the names of the columns which match your criteria

answered Sep 30 '20 at 19:24

gtomer

5,643
1
10
21

Loop through CSV files to pull out specific columns

2 Answers2