0

Beginner in coding, need help with this homework problem: Consider all columns whose name starts with "Cl" (Classification, Clustering, and assume that there could be many others). Retrive the rows of those people with the same value in all of their "Cl" columns. For example, you should return a person with 4.0 in all of the Cl columns, or a person with 3.0 in all of the Cl columns; but you should NOT return a person with 4.0 in all Cl columns except for one column where there is a 3.0. Hint: Start by computing the maximum and minimum value across the "Cl" columns for each student.

I'm not sure where to start with this problem? Cant quite understand whats being asked?

Picture of sample data set: [1]: https://i.stack.imgur.com/xglFm.png

The dataframe given code:

 import pandas as pd
 df = pd.read_csv("cleaned_survey.csv", index_col=0)
 df.drop(['ProgSkills','Languages','Expert'],axis=1,inplace=True)  
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
none
  • 33
  • 1
  • 7
  • 1
    You should post some dummy data. It is helpful to understand the problem and provide a quick reply. Since you're working with a Pandas `DataFrame`, it can just be some simple `DataFrame` with a few rows of dummy values you made up and in the format of the original data you're working with. This would be more helpful than posting an image. – edesz Feb 05 '19 at 03:57
  • @edesz I have a picture of the sample data set posted above. Let me know if helpful, thank you – none Feb 05 '19 at 21:54

1 Answers1

1

Generate some dummy data per the OP requirements

import pandas as pd

a = [['Classification','Clustering','Top'],
        [8,7,5],
        [8,1333,3],
        [50,50,1],
        [50,3363,2],
        [50,50,3],
        [83,50,4],
        [83,83,5]]
df = pd.DataFrame(a[1:], columns=a[0])
print(df)

   Classification  Clustering  Top
0               8           7    5
1               8        1333    3
2              50          50    1
3              50        3363    2
4              50          50    3
5              83          50    4
6              83          83    5

Select columns by partial column name (this returns 2 columns) (link used)

df = df[df.columns[df.columns.str.startswith('Cl')]]
print(df)

   Classification  Clustering
0               8           7
1               8        1333
2              50          50
3              50        3363
4              50          50
5              83          50
6              83          83

Finally, use pandas .nunique(axis=1) method to return the number of distinct observations column-wise (in the dummy data, if both columns contain the same value, then this returns 1). Then compare this to the integer 1. If these 2 values are equal (for a particular row), then you know that the number of unique entries for that row is 1 and the boolean mask from ...eq(..) returns True....in other words, that row contains the same value for all columns. Using pandas slicing, only the True rows are returned df = df[...], which is what is asked in the question - link used.

print(df[df.nunique(axis=1).eq(1)])

   Classification  Clustering
2              50          50
4              50          50
6              83          83

To use min and max, see this link - if the min and max values across each rows of all the columns are equal, then the elements are identical row-wise, as required (useful post on using apply column-wise)

print(df[df.apply(lambda x: min(x)==max(x), 1)])

   Classification  Clustering
2              50          50
4              50          50
6              83          83
edesz
  • 11,756
  • 22
  • 75
  • 123
  • 1
    Instead of `contains` you should use `startswith` because OP mentioned `Consider all columns whose name starts with "Cl" `. – Mohamed Thasin ah Feb 05 '19 at 05:40
  • @edesz when tried running your first block of code: df=pd.read_clipboard() df.columns=['Classification','Clustering,'Top'] print (df) I get this error message: Length mismatch: Expected axis has 7 elements, new values have 3 elements – none Feb 05 '19 at 21:58
  • @edesz and when I tried the second line of code I got this error message: Can only use .str accessor with string values (i.e. inferred_type is 'string', 'unicode' or 'mixed') – none Feb 05 '19 at 22:03
  • @MohamedThasinah any suggestions? – none Feb 05 '19 at 22:03
  • @none see updated answer to generate data. it could that your second problem `inferred_type is 'string', 'unicode' or 'mixed'` is related to the first (i.e. how pandas was reading/loading the data into a `DataFrame`). – edesz Feb 05 '19 at 23:51