0

UPDATED

Problem 1: I have a data set, where a lot of values are NaN. Using main.loc[main.isna().sum(axis=1) >= 2] outputs to:

  ID:  GNDR  COUNTRY    ...         BIKE      CAR        PBLC        
    1     0     NaN     ...          NaN      NaN         NaN          
    1     0     NaN     ...          NaN      NaN         NaN
    16    1     UK      ...          123       0         10232

Surely, row 0 and 1 should be dropped?

Problem 2: As example, if my ID is greater than 1 as shown above, this means that this person has entered data 16 times. Thus, I want to average this, such that people who only entered data once does not show as outliers to my perceptron later on. My thought was to iteratively average all rows with ID greater than 1 whilst loading data into my DataFrame.

SAMPLE CODE:
df_2 = pandas.read_csv('logs.csv', names=colnames_df_2, skiprows=[0]) df_2['ID']=df_2['ID'].apply(str)

main = df_1.merge(df_2, how='left', on='msno') main.loc[main.isna().sum(axis=1) >= 2] print(main)

2 Answers2

0

For problem-1

thresh parameter means:

Require that many non-NA values.

So, if you are getting both rows, that means, there are NO non-null values in the dataframe.

I tried with your df below, and it works.

In [527]: df
Out[527]: 
   ID  GNDR  COUNTRY  BIKE  CAR  PBLC
0   1     0      NaN   NaN  NaN   NaN
1   1     0      NaN   NaN  NaN   NaN

In [528]: df = df.dropna()
Out[528]: 
Empty DataFrame
Columns: [ID, GNDR, COUNTRY, BIKE, CAR, PBLC]
Index: []
Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
0

For problem-1 Just an example dataset to show..

>>> df
     A    B    C
0  foo    2    3
1  foo  NaN  NaN
2  foo    1    4
3  bar  NaN  NaN
4  foo  NaN  NaN

df.dropna(thresh=2) iterates through all the rows and keeps each row that has at least 2 non-na value. All rows have at least two non-na value so those are not dropped.

>>> df.dropna(thresh=2)
     A  B  C
0  foo  2  3
2  foo  1  4

Values where NaN count greater than 2:

>>> df.loc[df.isna().sum(axis=1) >= 2]
     A    B    C
0  foo  NaN  NaN
2  foo  NaN  NaN
4  foo  NaN  NaN
5  NaN  NaN  NaN

To Get the mean() , you can try like:

>>> df.B.ge(str(2))
0     True
1    False
2    False
3    False
4    False
Name: B, dtype: bool
>>>
>>>
>>> df[df.B.ge(str(2))]
     A  B  C
0  foo  2  3
>>> df[df.B.ge(str(2))]['C'].mean()
     3.0
Karn Kumar
  • 8,518
  • 3
  • 27
  • 53
  • Does not seem to work for me. Using: `main.loc[main.isna().sum(axis=1) >= 2]` still gives me: main = `6769469 1 0 NaN 7 20151020 NaN NaN NaN NaN NaN NaN NaN NaN 6769470 15 26 female 4 20151020 NaN NaN NaN NaN NaN NaN NaN NaN 6769471 1 0 NaN 4 20151020 NaN NaN NaN` –  Nov 26 '18 at 11:18
  • Which vertsion of pandas you are running on ? – Karn Kumar Nov 26 '18 at 11:20
  • @Moe, you need to provide the actual dataFrame's few line to learn the data sample to get the reall sense out of it. – Karn Kumar Nov 26 '18 at 11:22