3

So I tried running a code I had developed previously, which has run numerous times nicely using pandas.

My dataframe has a custom index (with unique string values as the index, representing a unique identifier, in this case, individual proteins), and file names as the columns. I then use an iterative procedure to assign counts to some cells in the dataframe. So, let's say I have a default dictionary (my_dict) with a given abritrary key, and the value is [filename, protein, count].

I have a sorted list of filenames, and a sorted list of proteins, called all_filenames and all_proteins, respectively.

 import pandas as pd
 df = pd.DataFrame(index=all_proteins, columns=all_filenames)

 from collections import defaultdict
 my_dict = defaultdict(list)

 ... (Assign values to the dictionary)

 for key in my_dict:
     my_filename = my_dict[key][0]
     my_protein = my_dict[key][1]
     my_count = my_dict[key][2]

     df[my_filename][my_protein] = my_count

However, whenever I print df, it for some reason returns entirely blank in this case (with the proper index and filenames), while it doesn't normally.

So to test, I did the following on the dataframe:

>>> my_filename in df.columns.tolist()
True
>>> my_protein in df.index.tolist()
True
>>> df[my_filename][my_protein]
nan
>>> my_count
3.0
>>> type(my_count)
<type 'numpy.float64'>
>>> 
>>> df[my_filename][my_protein] = my_count
>>> df[my_filename][my_protein]
nan
>>> 

I've tried df[my_filename].ix[my_protein], df[my_filename].loc[my_protein], and even creating a custom index.

Normally this script works fine. My file names are typically something like: beta_maxi070214_08, so no spaces or not ASCII characters.

My protein names are all standard, with all the names either being in the UniProtKB database, or being linkages between two proteins (ie, ACACA-ACACB).

I'm not really sure what's going on. Does anyone have any suggestions?

EDIT: Here is an example:

>>> my_filename 
'beta_orbi080714_05'
>>> my_protein 
'ACACA:K1316-ACACA:K1363'
>>> my_count 
3.0 
>>> type(my_count) 
<type 'numpy.float64'>
>>> df[my_filename][my_protein] = my_count
>>> df[my_filename][my_protein]
nan
>>> 
Alex Huszagh
  • 13,272
  • 3
  • 39
  • 67
  • What is my_column? Is this psuedo code, or the exact code you're running? Where is my_column defined? – Parker Oct 22 '14 at 04:45
  • So I basically import a list of files, and extract the filenames from the files. In this case, I tested it with a file I knew was in the list. For example, 'beta_maxi070214_08' is a string and a filename, and is a component of the list all_filenames (and also in the column). – Alex Huszagh Oct 22 '14 at 04:47
  • You didn't answer, what is my_column? Where is it defined? – Parker Oct 22 '14 at 04:48
  • Sorry, my bad, I'll make an edit. I just noticed (I'm using data that may not make any sense, and my boss wouldn't be happy if I posted it online, so I'll quickly touch this up). – Alex Huszagh Oct 22 '14 at 04:48
  • Can you at least post the values of `my_filename` `my_protein` for a case that's giving you NaN? – Parker Oct 22 '14 at 04:50
  • Yeah, so here's the exact example I show: >>> my_filename 'beta_orbi080714_05' >>> my_protein 'ACACA:K1316-ACACA:K1363' >>> my_count 3.0 >>> type(my_count) – Alex Huszagh Oct 22 '14 at 04:53
  • What version of pandas is this – Parker Oct 22 '14 at 05:01
  • Parker, that actually solved my issue. Can I give you the answer? – Alex Huszagh Oct 22 '14 at 05:01
  • Sure thing, just undeleted :) was about to post as a new answer – Parker Oct 22 '14 at 05:02

1 Answers1

4

Try: df.ix[my_filename,my_protein] = value

The reason for this (from my understanding) is that df['x']['y'] returns a copy of the data frame. So you ARE changing a value, but you're changing the value of a copy, that's not placed back into it.

Edit: DSM notes, .loc and .iloc are generally preferred to .ix, which has hard-to-explain semantics. And there's a section of the docs here devoted to explaining the view vs. copy issues involved http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy

Parker
  • 8,539
  • 10
  • 69
  • 98
  • You are correct, and it was correct in the body and I shoddily copied and pasted in (sorry, I tried to edit it before posting). This list is generated by a parser and then appended to a list (generated by the code I use). I've printed the list to file while debugging and it works well. – Alex Huszagh Oct 22 '14 at 04:57
  • Thank you so much! It actually works now. I'm still confused as to why, but at least now I can have a functioning code more generally. – Alex Huszagh Oct 22 '14 at 05:03
  • 4
    Minor: these days, `.loc` and `.iloc` are generally preferred to `.ix`, which has hard-to-explain semantics. And there's a section of the docs [here](http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy) devoted to explaining the view vs. copy issues involved. – DSM Oct 22 '14 at 05:08
  • Noted, thanks. I have little experience with pandas, and didn't realize this. Will take a look – Parker Oct 22 '14 at 05:10
  • Perfect, thanks DSM. I've heard that .ix is much slower compared to .loc and straight returning a copy, which is why I was avoiding it before. It seems like the documents you show demonstrate that .loc would be much faster than chained [], so that should help me speed up my code as well. Thanks ^_^ – Alex Huszagh Oct 22 '14 at 05:21