0

When I run this code in Jupyter Notebook:

columns = ['nkill', 'nkillus', 'nkillter','nwound', 'nwoundus', 'nwoundte', 'propvalue', 'nperps', 'nperpcap', 'iyear', 'imonth', 'iday']

for col in columns:
    # needed for any missing values set to '-99'
    df[col] = [np.nan if (x < 0) else x for x in 
df[col].tolist()]
    # calculate the mean of the column
    column_temp = [0 if math.isnan(x) else x for x in df[col].tolist()]
    mean = round(np.mean(column_temp))
    # then apply the mean to all NaNs
    df[col].fillna(mean, inplace=True)

I receive the following error:

AttributeError                            Traceback 
(most recent call last)
<ipython-input-56-f8a0a0f314e6> in <module>()
  3 for col in columns:
  4     # needed for any missing values set to '-99'
----> 5     df[col] = [np.nan if (x < 0) else x for x in df[col].tolist()]
  6     # calculate the mean of the column
  7     column_temp = [0 if math.isnan(x) else x for x in df[col].tolist()]

/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
   4374             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   4375                 return self[name]
-> 4376             return object.__getattribute__(self, name)
   4377 
   4378     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'tolist'

The code works fine when I run it in Pycharm, and all of my research has led me to conclude that it should be fine. Am I missing something?

I've created a Minimal, Complete, and Verifiable example below:

import numpy as np
import pandas as pd
import os
import math

# get the path to the current working directory
cwd = os.getcwd()

# then add the name of the Excel file, including its extension to get its relative path
# Note: make sure the Excel file is stored inside the cwd
file_path = cwd + "/data.xlsx"

# Copy the database to file
df = pd.read_excel(file_path)

columns = ['nkill', 'nkillus', 'nkillter', 'nwound', 'nwoundus', 'nwoundte', 'propvalue', 'nperps', 'nperpcap', 'iyear', 'imonth', 'iday']

for col in columns:
    # needed for any missing values set to '-99'
    df[col] = [np.nan if (x < 0) else x for x in df[col].tolist()]
    # calculate the mean of the column
    column_temp = [0 if math.isnan(x) else x for x in df[col].tolist()]
    mean = round(np.mean(column_temp))
    # then apply the mean to all NaNs
    df[col].fillna(mean, inplace=True)
Uncle_Timothy
  • 101
  • 1
  • 2
  • 10
  • Does this work? `df[col].tolist()` => `df[col].values.tolist()` – JacobIRR Dec 03 '18 at 18:27
  • No sorry, it throws up a different error: TypeError: '<' not supported between instances of 'list' and 'int' – Uncle_Timothy Dec 03 '18 at 18:29
  • Undoing the dupe-close because `df[col]` should be a Series, not a DataFrame. The correct fix should involve figuring out why it's not a Series, not using the code for DataFrames. – user2357112 Dec 03 '18 at 18:30
  • That said, we're going to need to see more than what you've shown us to reproduce this error. Perhaps `df` isn't what you think it is. – user2357112 Dec 03 '18 at 18:33
  • Often attribute errors like this are the result of the object being the wrong type. The first comment was based on the idea that numpy arrays have a `.tolist` method, but `DataFrame` does not. `Series` has this method (which uses the numpy `tolist` on its values). – hpaulj Dec 03 '18 at 18:36
  • @user2357112 What can I show you? I parsed in an Excel file and stored it in df: file_path = cwd + "/globalterrorismdb_0718dist.xlsx" # Copy the database to file df = pd.read_excel(file_path) – Uncle_Timothy Dec 03 '18 at 18:38
  • 3
    @Uncle_Timothy, See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). Mock up a **[mcve]**, i.e. a **minimal** and **reproducible** example of the problem you observe. – jpp Dec 03 '18 at 18:39
  • 2
    I would start with a `print(type(df))` and `print(type(df['nkill']))`, to verify that the objects are (are not) dataframe and series. – hpaulj Dec 03 '18 at 18:41
  • @jpp I've included a Minimal, Complete, and Verifiable example in my post – Uncle_Timothy Dec 03 '18 at 18:49
  • 1
    Without the `xlsx` file we can't copy-n-paste and run your code. It isn't Verifiable. – hpaulj Dec 03 '18 at 19:41
  • If you're trying to replace values in a dataframe, there may be better approaches than a list comprehension... what happens if you get rid of the `.tolist()`? – Evan Dec 03 '18 at 20:07

1 Answers1

2

You have an XY Problem. You've described what you are trying to achieve in your comments, but your approach is not appropriate for Pandas.

Avoid for loops and list

With Pandas, you should look to avoid explicit for loops or conversion to Python list. Pandas builds on NumPy arrays which support vectorised column-wise operations.

So let's look at how you can rewrite:

for col in columns:
    # values less than 0 set to NaN
    # calculate the mean of the column with 0 for NaN
    # then apply the mean to all NaNs

You can now use Pandas methods to achieve the above.

apply + pd.to_numeric + mask + fillna

You can define a function mean_update and use pd.DataFrame.apply to apply it to each series:

df = pd.DataFrame({'A': [1, -2, 3, np.nan],
                   'B': ['hello', 4, 5, np.nan],
                   'C': [-1.5, 3, np.nan, np.nan]})

def mean_update(s):
    s_num = pd.to_numeric(s, errors='coerce')  # convert to numeric
    s_num = s_num.mask(s_num < 0)              # replace values less than 0 with NaN
    s_mean = s_num.fillna(0).mean()            # calculate mean
    return s_num.fillna(s_mean)                # replace NaN with mean

df = df.apply(mean_update)                     # apply to each series

print(df)

     A     B     C
0  1.0  2.25  0.75
1  1.0  4.00  3.00
2  3.0  5.00  0.75
3  1.0  2.25  0.75
jpp
  • 159,742
  • 34
  • 281
  • 339