Goal
My goal is to merge two DataFrames by their common column (gene names) so I can take a product of each gene score across each gene row. I'd then perform a groupby
on patients and cells and sum all scores from each. The ultimate data frame should look like this:
patient cell
Pat_1 22RV1 12
DU145 15
LN18 9
Pat_2 22RV1 12
DU145 15
LN18 9
Pat_3 22RV1 12
DU145 15
LN18 9
That last part should work fine, but I have not been able to perform the first merge on gene names due to a MemoryError
. Below are snippets of each DataFrame.
Data
cell_s =
Description Name level_2 0
0 LOC100009676 100009676_at LN18_CENTRAL_NERVOUS_SYSTEM 1
1 LOC100009676 100009676_at 22RV1_PROSTATE 2
2 LOC100009676 100009676_at DU145_PROSTATE 3
3 AKT3 10000_at LN18_CENTRAL_NERVOUS_SYSTEM 4
4 AKT3 10000_at 22RV1_PROSTATE 5
5 AKT3 10000_at DU145_PROSTATE 6
6 MED6 10001_at LN18_CENTRAL_NERVOUS_SYSTEM 7
7 MED6 10001_at 22RV1_PROSTATE 8
8 MED6 10001_at DU145_PROSTATE 9
cell_s is about 10,000,000 rows
patient_s =
id level_1 0
0 MED6 Pat_1 1
1 MED6 Pat_2 1
2 MED6 Pat_3 1
3 LOC100009676 Pat_1 2
4 LOC100009676 Pat_2 2
5 LOC100009676 Pat_3 2
6 ABCD Pat_1 3
7 ABCD Pat_2 3
8 ABCD Pat_3 3
....
patient_s is about 1,200,000 rows
Code
def get_score(cell, patient):
cell_s = cell.set_index(['Description', 'Name']).stack().reset_index()
cell_s.columns = ['Description', 'Name', 'cell', 's1']
patient_s = patient.set_index('id').stack().reset_index()
patient_s.columns = ['id', 'patient', 's2']
# fails here:
merged = cell_s.merge(patient_s, left_on='Description', right_on='id')
merged['score'] = merged.s1 * merged.s2
scores = merged.groupby(['patient','cell'])['score'].sum()
return scores
I was getting a MemoryError when initially read_csv
ing these files, but then specifying the dtypes resolved the issue. Confirming that my python is 64 bit did not fix my issue either. I haven't reached the limitations on pandas, have I?
Python 3.4.3 |Anaconda 2.3.0 (64-bit)| Pandas 0.16.2