8

Goal

My goal is to merge two DataFrames by their common column (gene names) so I can take a product of each gene score across each gene row. I'd then perform a groupby on patients and cells and sum all scores from each. The ultimate data frame should look like this:

    patient  cell 
    Pat_1    22RV1    12
             DU145    15
             LN18      9
    Pat_2    22RV1    12
             DU145    15
             LN18      9
    Pat_3    22RV1    12
             DU145    15
             LN18      9

That last part should work fine, but I have not been able to perform the first merge on gene names due to a MemoryError. Below are snippets of each DataFrame.

Data

cell_s =

    Description          Name                      level_2  0
0  LOC100009676  100009676_at  LN18_CENTRAL_NERVOUS_SYSTEM  1
1  LOC100009676  100009676_at               22RV1_PROSTATE  2
2  LOC100009676  100009676_at               DU145_PROSTATE  3
3          AKT3      10000_at  LN18_CENTRAL_NERVOUS_SYSTEM  4
4          AKT3      10000_at               22RV1_PROSTATE  5
5          AKT3      10000_at               DU145_PROSTATE  6
6          MED6      10001_at  LN18_CENTRAL_NERVOUS_SYSTEM  7
7          MED6      10001_at               22RV1_PROSTATE  8
8          MED6      10001_at               DU145_PROSTATE  9

cell_s is about 10,000,000 rows

patient_s =

             id level_1  0
0          MED6   Pat_1  1
1          MED6   Pat_2  1
2          MED6   Pat_3  1
3  LOC100009676   Pat_1  2
4  LOC100009676   Pat_2  2
5  LOC100009676   Pat_3  2
6          ABCD   Pat_1  3
7          ABCD   Pat_2  3
8          ABCD   Pat_3  3
    ....

patient_s is about 1,200,000 rows

Code

def get_score(cell, patient):
    cell_s = cell.set_index(['Description', 'Name']).stack().reset_index()
    cell_s.columns = ['Description', 'Name', 'cell', 's1']

    patient_s = patient.set_index('id').stack().reset_index()
    patient_s.columns = ['id', 'patient', 's2']

    # fails here:
    merged = cell_s.merge(patient_s, left_on='Description', right_on='id')
    merged['score'] = merged.s1 * merged.s2

    scores = merged.groupby(['patient','cell'])['score'].sum()
    return scores

I was getting a MemoryError when initially read_csving these files, but then specifying the dtypes resolved the issue. Confirming that my python is 64 bit did not fix my issue either. I haven't reached the limitations on pandas, have I?

Python 3.4.3 |Anaconda 2.3.0 (64-bit)| Pandas 0.16.2

Community
  • 1
  • 1
Thomas Matthew
  • 2,826
  • 4
  • 34
  • 58
  • You might save quite a bit of memory by converting strings to categoricals if you haven't already. I don't believe 64-bit pandas will have any memory limitations other than what your computer has. – JohnE Aug 01 '15 at 19:25
  • When importing with 'read_csv', I set the 'dtype='category' for all columns containing strings, I get `TypeError: data type "category" not understood`. I guess because [Categorical is not a numpy array](http://pandas-docs.github.io/pandas-docs-travis/categorical.html#categorical-is-not-a-numpy-array). Can you think of a work around? Can I simply ignore the MemoryError? – Thomas Matthew Aug 01 '15 at 19:51
  • I think you have to convert after reading in. I usu do something like `df['x'] = df['x'].astype('category')` – JohnE Aug 01 '15 at 20:55

2 Answers2

5

Consider two workarounds:

CSV By CHUNKS

Apparently, read_csv can suffer performance issues and therefore large files must load in iterated chunks.

cellsfilepath = 'C:\\Path\To\Cells\CSVFile.csv'
tp = pd.io.parsers.read_csv(cellsfilepath, sep=',', iterator=True, chunksize=1000)
cell_s = pd.concat(tp, ignore_index=True)

patientsfilepath = 'C:\\Path\To\Patients\CSVFile.csv'
tp = pd.io.parsers.read_csv(patientsfilepath, sep=',', iterator=True, chunksize=1000)
patient_s = pd.concat(tp, ignore_index=True)

CSV VIA SQL

As a database guy, I always recommend handling large data loads and merging/joining with a SQL relational engine that scales well for such processes. I have written many a comment on dataframe merge Q/As to this effect -even in R. You can use any SQL database including file server dbs (Access, SQLite) or client server dbs (MySQL, MSSQL, or other), even where your dfs derive. Python maintains a built-in library for SQLite (otherwise you use ODBC); and dataframes can be pushed into databases as tables using pandas to_sql:

import sqlite3

dbfile = 'C:\\Path\To\SQlitedb.sqlite'
cxn = sqlite3.connect(dbfile)
c = cxn.cursor()

cells_s.to_sql(name='cell_s', con = cxn, if_exists='replace')
patient_s.to_sql(name='patient_s', con = cxn, if_exists='replace')

strSQL = 'SELECT * FROM cell_s c INNER JOIN patient_s p ON c.Description = p.id;'
# MIGHT HAVE TO ADJUST ABOVE FOR CELL AND PATIENT PARAMS IN DEFINED FUNCTION

merged = pd.read_sql(strSQL, cxn)
Community
  • 1
  • 1
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks for the two work arounds. I started running `read_csv` in chunks, and it appears to reduce the memory usage (though I haven't measured exactly). The SQL solution looks like "the way", but when I run the above code, it throws an attribute error of `'sqlite3.Cursor' object has no attribute 'cursor'` at `cells_s.to_sql`. I located the issue to line 1467 of `pandas/io/sql.py` , in execute `cur = self.con.cursor()`. Checked the docs and SO, and couldn't determine the cause of the error. Any ideas? – Thomas Matthew Aug 03 '15 at 02:04
  • EDIT: I eliminated the `c = conn.cursor()` following an example from [Sebastian Raschka](http://stackoverflow.com/questions/11622652/large-persistent-dataframe-in-pandas/28371706#28371706). Seems to work for my merge, but I [wouldn't be able to query](http://stackoverflow.com/questions/6318126/why-do-you-need-to-create-a-cursor-when-querying-a-sqlite-database) my new database, correct? – Thomas Matthew Aug 03 '15 at 02:35
  • 1
    First issue is due to me using the cursor object in `con` argument of `to_sql()` and not the connection object. I made the edit with a new variable, `cxn`. Yes, you should be able to query db using [cursor's execute()](https://docs.python.org/2/library/sqlite3.html) with fetchall() that outputs in list format. But do note in above, query results are being passed into a pandas df using [read_sql()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html). – Parfait Aug 03 '15 at 02:54
  • 1
    It should not be necessary to do a workaround. Pandas should be able to handle this. Has anybody reported that issue to pandas? – Nickpick Jun 02 '16 at 09:03
1

You may have to do it in pieces, or look into blaze. http://blaze.pydata.org

Skorpeo
  • 2,362
  • 2
  • 15
  • 20