0

I imported a csv file and currently it is in a dataframe. It has a total of about 28 columns and I only wanted to keep 9 of them. This is what my code looks like

import os, glob
import pandas as pd

#set the directory
os.chdir(r'C:\Documents\test') 
#set the type of file
extension = 'csv' 
#take all files with the csv extension into an array
all_filenames = [i for i in glob.glob('*.{}'.format(extension))] 

col_to_keep=[3,5] #is this how I would put the column position?
#combine all files in the list
df = [pd.read_csv(f, delimiter=';', error_bad_lines=False) for f in all_filenames]

print(df[col_to_keep])

my variable col_to_keep, previously I had it as

col_to_keep = ['Name', 'ID', 'Area', 'Length']

however I get an error that reads

line 21, in <module>
    print(df[col_to_keep])

TypeError: list indices must be integers or slices, not list

I'm not sure what I'm doing wrong because I tried using the name of the columns and also what I think is the position of the column. The only other reason I can think of is that some of the value in the columns are floats (i.e 123.98)

I plan on taking this information and bringing it into excel, where eventually I will create a loop that goes through all csv files in a specific folder.

How would I be able to only keep the columns that I want?

Thank you in advance.

rcheeks23
  • 15
  • 5
  • You've made a list of DataFrames with a comprehension `[pd.read_csv(f, delimiter=';', error_bad_lines=False) for f in all_filenames]` and called it `df`. It is a _list_ of DataFrames. You'll need to access a specific DataFrame with a standard python list index `df[0]` _then_ you can use DataFrame operations `df[0][['Name', 'ID', 'Area', 'Length']]` or you could iterate over the dataframes in a loop. You might consider renaming `df` to `dfs` or `list_of_df` to avoid future confusion. – Henry Ecker Oct 05 '21 at 23:54
  • Beyond this [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) has a `usecols` parameter which could be used in the comprehension instead of subsetting afterwards. `dfs = [pd.read_csv(f, delimiter=';', error_bad_lines=False, usecols=col_to_keep) for f in all_filenames]` if you wanted to read in _only_ particular columns. – Henry Ecker Oct 05 '21 at 23:56
  • wow thank you so much. So if I understand correct, now that I have that dataframe list I would be able to rearrange the columns by using doing new_col = pd.DataFrame(df[0]['ID', 'Name', 'Length', 'Area']) because the dataframe is selecting the first dataframe in the list? – rcheeks23 Oct 06 '21 at 00:21

0 Answers0