0

Let us say I have the following columns in a data frame:

title
year
actor1
actor2
cast_count
actor1_fb_likes
actor2_fb_likes
movie_fb_likes

I want to select the following columns from the data frame and ignore the rest of the columns :

  1. the first 2 columns (title and year)
  2. some columns based on name - cast_count
  3. some columns which contain the string "actor1" - actor1 and actor1_fb_likes

I am new to pandas. For each of the above operations, I know what method to use. But I want to do all three operations together as all I want is a dataframe that contains the above columns that I need for further analysis. How do I do this?

Here is example code that I have written:

data = {
"title":['Hamlet','Avatar','Spectre'],
"year":['1979','1985','2007'],
"actor1":['Christoph Waltz','Tom Hardy','Doug Walker'],
"actor2":['Rob Walker','Christian Bale ','Tom Hardy'],
"cast_count":['15','24','37'],
"actor1_fb_likes":[545,782,100],
"actor2_fb_likes":[50,78,35],
"movie_fb_likes":[1200,750,475],
}
df_input = pd.DataFrame(data)
print(df_input)

df1 = df_input.iloc[:,0:2] # Select first 2 columns
df2 = df_input[['cast_count']] #select some columns by name - cast_count
df3 = df_input.filter(like='actor1') #select columns which contain the string "actor1" - actor1 and actor1_fb_likes

df_output = pd.concat(df1,df2, df3) #This throws an error that i can't understand the reason
print(df_output)
Siraj Samsudeen
  • 1,624
  • 7
  • 26
  • 35
  • Does this answer your question? [Selecting multiple columns in a pandas dataframe](https://stackoverflow.com/questions/11285613/selecting-multiple-columns-in-a-pandas-dataframe) – Henry Yik Jun 17 '20 at 01:59
  • @HenryYik, thanks for the link but it does NOT answer my question which is to combine multiple criteria and I already went through the docs for both filter, loc and iloc. – Siraj Samsudeen Jun 17 '20 at 02:06
  • kindly share sample data with expected output – sammywemmy Jun 17 '20 at 02:39
  • @sammywemmy, I have added sample code and some clarifications to make my intent clear. – Siraj Samsudeen Jun 19 '20 at 01:24
  • ```pd.concat([df_input.iloc[:,:2], df_input.filter(regex = "actor1|cast")], axis = 1)``` does this work? – sammywemmy Jun 19 '20 at 01:52
  • @sammywemmy yes this would work and thanks - your suggestions eliminates the local variables. But i was not sure whether this is the right way. Would this produce unnecessary copy of the df_input dataframe? – Siraj Samsudeen Jun 20 '20 at 03:34
  • 1
    well, you would be accessing the dataframe twice, which doesnt look excessive. Another way about it, is to read in the data once, get the cols u are interested in : ```cols = df_input.columns[:2].append( df_input.columns.str.extract(r"(actor1.*|cast.*)", expand=False).dropna() )``` , then keep only the interested columns in the dataframe : ```df_input.filter(cols)```, this way you read in the data only once, avoid ```concat```, and overall reduce computation time. – sammywemmy Jun 20 '20 at 03:41

1 Answers1

0

Question 1:

df_1 = df[['title', 'year']]

Question 2:

# This is an example but you can put whatever criteria you'd like
df_2 = df[df['cast_count'] > 10]

Question 3:

# This is an example but you can put whatever criteria you'd like this way
df_2 = df[(df['actor1_fb_likes'] > 1000) & (df['actor1'] == 'actor1')]

Make sure each filter is contained within it's own set of parenthesis () before using the & or | operators. & acts as an and operator. | acts as an or operator.

Rookie
  • 305
  • 2
  • 11
  • Thank you for your time and effort to help me. Sorry - this does not answer my question. I have edited my question to make my intent more clear. – Siraj Samsudeen Jun 19 '20 at 01:24