0

I am trying to load a bunch of csvs into a database and would like to get rid of any rows from these tables that have the value "-". I'm trying to do the same thing in the folllowing link but using an iterable instead of predetermined column as I don't know which tables and columns will have these values:

Deleting DataFrame row in Pandas based on column value

My code: dfs = {}

for doc in fList:
    i = "{}\\{}".format(path, doc)

    df = pd.read_csv(i)

    for col in df.columns:
        df = df[df.col != "-"]

This returns the following error:

AttributeError                            Traceback (most recent call last)
<ipython-input-291-43edac7a4ed7> in <module>()
      8     #print dfs
      9     for col in df:
---> 10         df = df[df.col != "-"]

C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   2968             if name in self._info_axis:
   2969                 return self[name]
-> 2970             return object.__getattribute__(self, name)
   2971 
   2972     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'col'

It seems that I cannot use the iterable in the loop. It would defeat the perpose of writing the script if I have to open each file and change the values. Is there anyway to loop through the tables and delete rows with the bad values?

geoJshaun
  • 637
  • 2
  • 11
  • 32

1 Answers1

3

You cannot dynamically access df's column using a variable as you are trying, that leads to an AttributeError. Because the . will search for df's attribute col, and not df's attribute <value in col>. There's a difference.

If you wanted to, you'd need the __getitem__ accessor; df[col]. However, you should avoid loopy solutions where you can. Here are a couple of alternatives.

Option 1
For your case, eq + any should suffice.

df = df[df.astype(str).eq('-').any(1)]                # `astype` conversion

Or,

df = df[df.select_dtypes(['object']).eq('-').any(1)]  # `select_dtypes`, thanks MaxU!

Option 2
Another option would be to use a na_values argument with read_csv, so when reading in your data, these values are converted to NaN, which you can drop.

df = pd.read_csv('file.csv', na_values=['-'])

And now, call dropna on your data -

df.dropna(inplace=True)
cs95
  • 379,657
  • 97
  • 704
  • 746
  • @COLDSPEED tried your solution and got this: TypeError: Could not compare ['-'] with block values – geoJshaun Jan 15 '18 at 22:07
  • @ShaunO `df[df.astype(str).eq('-').any(1)]` it seems like you don't have all string columns. – cs95 Jan 15 '18 at 22:09
  • @COLDSPEED that did it! Weird, I was sure they all were strings. Thank you so much! – geoJshaun Jan 15 '18 at 22:12
  • @ShaunO No problem, feel free to vote on, and accept the answer if it was helpful. I'd appreciate it ;-) – cs95 Jan 15 '18 at 22:13
  • 1
    @COLDSPEED Done! – geoJshaun Jan 15 '18 at 22:15
  • 2
    a little improvement: `df = df[df.select_dtypes(['object']).eq('-').any(1)]` – MaxU - stand with Ukraine Jan 15 '18 at 22:21
  • @COLDSPEED Sorry. I'm actually getting some strange results with this. I want to delete rows with "-" in an effort to clean data. All the columns need to be numeric to perform stats on them and the "-" (census data for Nan) is throwing it all off. When I run the .eq solution it changed one entire column "-" and then randomly changed other column values in another table to "-". – geoJshaun Jan 15 '18 at 23:13
  • @ShaunO Wow, you should've said so. In that case, When calling `read_csv`, add a `na_values=['-']` argument, so those values are made null. Then, just call `dropna`. – cs95 Jan 15 '18 at 23:14
  • @COLDSPEED yes, a thousand apologies – geoJshaun Jan 16 '18 at 00:14