Using read_csv in Pandas, I have imported a huge data set with >500K rows, each row contains a taxonomic code with location and abundance values from a specific date and station. These values repeat for different stations over time. I cannot create a unique time stamp because time was not recorded, thus I only have the date.
My columns are : Cruise Name, Station number, Latitude, Longitude, Date(YY/MM/DD), Taxonomic Code, Abundance
I need to rearrange the data such that my columns will be the individual taxonomic codes (n>400) as the column name with abundance as values for those columns, and the rows will be occurrence with unique index consisting of location and date information. To further complicate this, I need to include zeros where there were no observations for the taxonomic codes for those particular samples
edit: I created a new data frame with a unique identifier including all pertinent location information in a single column, the abundance, and taxonomic ID using:
df['ID'] = df[['timestamp','cruise','station','lat','long','depth']].apply(lambda x: ','.join(map(str, x)), axis=1)
df3 = pd.DataFrame([df.ID,df.TaxonomicCode,df.Abundance]
ID oldtxc zoodns100
0 1977-02-13 00:00:00,MM7701,2,41.1833,-70.6667,... 101 114.95
1 1977-02-13 00:00:00,MM7701,2,41.1833,-70.6667,... 102 40118.18
define variables and count using numpy unique:
species = df3['TaxonomicCode']
cruise=df3['ID']
taxa=np.unique(species)
#419
locats = np.unique(cruise)
#27530
I then created another data frame filled with zeros using:
aa=pd.DataFrame(index=locats, columns=taxa)
#create empty matrix
aa=aa.fillna(0)
#fill NaN with 0
2 100 101 102 103 104 105 106 107 108 ... 4500 4504 4601 4604 4700 5000 5100 5101 5150 9114
1977-02-13 00:00:00,MM7701,2,41.1833,-70.6667,33.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Now I want to loop through the raw data to fill in the values of 'aa'?
for d in range(len(df3)):
loc1 = df3.ID[d]
tax1 = df3.oldtxc[d]
locIndex = np.where(locats == loc1)[0][0]
taxIndex = np.where(taxa == tax1)[0][0]
aa[locIndex, taxIndex] = df3.zoodns100[d]
This works initially and gives the expected output, but crashes my computer after only ~ 1000 iterations (with 509K iterations left to go... gulp.) I am using iPython notebook and/or spyder and the result is the same. Could this be a memory issue?
Ultimately I need to print the resulting dataframe 'aa' to a CSV file, is there a way to do this during the iteration to clear memory?
edit 2
Now I see that my loop is causing a vertical concatenation of columns onto my dataframe 'aa', which could explain the crashing. What I am attempting to do is to fill in the values of 'aa' (which is and should ultimately be 419 rows x 27530 columns) from a single column in dataframe ('df3') which is 510K rows long and contains the abundance values for each observed species at each location. I need to match all of the abundance data from each location into my respective columns of 'aa' for each row of 'aa.' Each row of 'aa' is a unique location index, created using 'np.unique' on the location values in 'df3' 510K -> 27K unique stations.
I am trying to iterate over the length 'df3' and attempting to find the matching location indices (rows) in 'df3' and pair the abundance data associated with those indices to the corresponging single row in 'aa'
(clear as mud, right? this is hard to explain...)
I am sure it is my syntax, but I want to assign values to dataframe 'aa'
aa[row, column] = value
this appears to be causing a concatenation, rather than assigning values. What am I missing here?