0

Using read_csv in Pandas, I have imported a huge data set with >500K rows, each row contains a taxonomic code with location and abundance values from a specific date and station. These values repeat for different stations over time. I cannot create a unique time stamp because time was not recorded, thus I only have the date.

My columns are : Cruise Name, Station number, Latitude, Longitude, Date(YY/MM/DD), Taxonomic Code, Abundance

I need to rearrange the data such that my columns will be the individual taxonomic codes (n>400) as the column name with abundance as values for those columns, and the rows will be occurrence with unique index consisting of location and date information. To further complicate this, I need to include zeros where there were no observations for the taxonomic codes for those particular samples


edit: I created a new data frame with a unique identifier including all pertinent location information in a single column, the abundance, and taxonomic ID using:
df['ID'] = df[['timestamp','cruise','station','lat','long','depth']].apply(lambda x: ','.join(map(str, x)), axis=1)
df3 = pd.DataFrame([df.ID,df.TaxonomicCode,df.Abundance]

    ID  oldtxc  zoodns100
0   1977-02-13 00:00:00,MM7701,2,41.1833,-70.6667,...   101 114.95
1   1977-02-13 00:00:00,MM7701,2,41.1833,-70.6667,...   102 40118.18

define variables and count using numpy unique:
species = df3['TaxonomicCode']
cruise=df3['ID']
taxa=np.unique(species) #419
locats = np.unique(cruise) #27530

I then created another data frame filled with zeros using:
aa=pd.DataFrame(index=locats, columns=taxa) #create empty matrix
aa=aa.fillna(0) #fill NaN with 0

2   100 101 102 103 104 105 106 107 108 ... 4500    4504    4601    4604    4700    5000    5100    5101    5150    9114
1977-02-13 00:00:00,MM7701,2,41.1833,-70.6667,33.0  0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0

Now I want to loop through the raw data to fill in the values of 'aa'?

for d in range(len(df3)): loc1 = df3.ID[d] tax1 = df3.oldtxc[d] locIndex = np.where(locats == loc1)[0][0] taxIndex = np.where(taxa == tax1)[0][0] aa[locIndex, taxIndex] = df3.zoodns100[d]

This works initially and gives the expected output, but crashes my computer after only ~ 1000 iterations (with 509K iterations left to go... gulp.) I am using iPython notebook and/or spyder and the result is the same. Could this be a memory issue?

Ultimately I need to print the resulting dataframe 'aa' to a CSV file, is there a way to do this during the iteration to clear memory?


edit 2

Now I see that my loop is causing a vertical concatenation of columns onto my dataframe 'aa', which could explain the crashing. What I am attempting to do is to fill in the values of 'aa' (which is and should ultimately be 419 rows x 27530 columns) from a single column in dataframe ('df3') which is 510K rows long and contains the abundance values for each observed species at each location. I need to match all of the abundance data from each location into my respective columns of 'aa' for each row of 'aa.' Each row of 'aa' is a unique location index, created using 'np.unique' on the location values in 'df3' 510K -> 27K unique stations.

I am trying to iterate over the length 'df3' and attempting to find the matching location indices (rows) in 'df3' and pair the abundance data associated with those indices to the corresponging single row in 'aa'

(clear as mud, right? this is hard to explain...)

I am sure it is my syntax, but I want to assign values to dataframe 'aa' aa[row, column] = value this appears to be causing a concatenation, rather than assigning values. What am I missing here?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Ryan
  • 316
  • 2
  • 14

1 Answers1

0

I can't understand your entire question, but I have a couple pointers that might help.

First, I don't think there's any reason for your statement:

aa=aa.fillna(0)

There's no benefit to preallocating all of those zeros, and it clutters your code.

I think instead it's more efficient for you to have something along the lines of:

aa=pd.DataFrame(index=locats, columns=taxa)
for d in range(len(df3))
    ...
    #build a Series (new_row) for Just one row 
    ...
    aa = aa.append(new_row, ignore_index=True) #T/F depending on what you want

Also, you might want to reconsider your for loop. Pandas has an iterrows() function that you could use instead. You use it like this:

for row_index, row in df3.iterrows(): 

When it comes to concatenating, you may introduce new performance issues. There's a post here that discusses your options. But if you look, those are talking about millions, and yours are much less. So I think there's hope.

Along those lines, don't feel obligated to solve the entire problem in one iteration. That's another reason not to allocate everything in advance. If you have genuine performance issues, it might be possible to break it off in chunks. For example, every 1000 rows iterated, you could flush your current DataFrame to a .csv file, thereby releasing that memory. You might end up with 500 .csv files, but then a separate function would be able to read them all in. Assuming they are the only .csv files in the directory:

def concatinate_files(files_path):
    file_list= []
    for file_ in os.listdir(files_path):
        if file_.endswith('.csv'):
            file_list.append(files_path + '/' + file_)
   combined_df = DataFrame()
   for file_name in file_list:
        df = pd.read_csv(file_name)
       combined_df = combined_df.append(df, ignore_index=False)

Hope that helps.

UPDATE 8/20 in response to your 'edit2'

Your problem in your most recent post is that if 'row' and 'column' are integers, than you are trying to use integer location indexing, but not calling the correct function (iloc). That causes columns to be appended. Try running this example code to see what I mean.

df = DataFrame(np.random.randn(4, 4))
df['1','2']=3   #not what you want
print df
df.iloc[1,2]=3  #what I think you mean
print df        

Again, this goes back to my original suggestion though. I don't think it's in your best interest to allocate the 419x27530 up front. I think some of your problems are from your mindset/insistence to try to fit things that way. Besides the preallocation, you mention that your data orientation is a problem, but I'm not clear on exactly how that is. It's perfectly valid to build your results as 27530x1, 27530x2 ... 27530x419 and then call DataFrame.Transpose (df.T) to get the 419x27530 orientation you want.

Community
  • 1
  • 1
Jeff Ellen
  • 540
  • 2
  • 8
  • Thank you for your help, but I am still struggling to figure out how to tackle this problem. I like your idea of splitting the results up however - I just realized I have a problem with the original list of unique locations and that my code writes to 'aa' not from top to bottom, but all over the place because the sorting is wrong on my unique IDs ('year', 'month', 'day',... but it is sorted by year, then the first number of the 'month' so October comes before January, which writes into row 173 in 'aa' as the first entry). – Ryan Aug 19 '14 at 21:42
  • Those dates are easily sortable. If you think about it, sorting dates is a very common problem. No need for you to reinvent the wheel. Is the 'dtype' for your timestamp column not numpy.datetime64? If it was, then sorting would work correctly. One way would be to reparse the data, as explained [here](http://stackoverflow.com/questions/17134716/convert-dataframe-column-type-from-string-to-datetime), a faster way if you are confident in the data would be to simply change the dtype using [DataFrame.astype](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html) – Jeff Ellen Aug 20 '14 at 14:37
  • the iloc was the trick, silly me. My original loop now works in about 10 minutes or less. I don't see any other way of doing it, unfortunately. When I create the empty dataframe 'aa' it is already filled with NaN, so I just replace those with zeros, which I need for later analysis anyway. I can't see that having a huge effect on the time. Anyway, one hurdle down, many more to come. Thanks for your input and help. – Ryan Aug 28 '14 at 20:43