Can Pandas handle data frames that are larger than memory?

Asked Jan 26 '18 at 10:51

Active Jan 29 '18 at 08:03

Viewed 1,815 times

Say I have a CSV file which is 20TB.

Is there a way to load that into a data frame on a machine with only 16GB of memory ?

For example, what if I wanted to do:

data = pd.read_csv(csv_path)
data = data.drop_duplicates(content_column_name)

edited Jan 29 '18 at 08:03

asked Jan 26 '18 at 10:51

Rahul Iyer

19,924
21
96
190

2

No it won't work, you'll get an out of memory error., you should use `pytables` for this kind of operation see https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas – EdChum Jan 26 '18 at 10:53
Define "load". If you mean, "load into memory", probably not. I can contrive a situation where it could work. Let's say you have a column of X mio rows with one very long string. If you read it in as a category, then it would only store single integer identifiers in that column. This is, of course, lossless. – jpp Jan 26 '18 at 10:54
1

Check out dask! – Ignacio Vergara Kausel Jan 26 '18 at 10:54
2

Are you sure it's 20 *TB*? – cs95 Jan 26 '18 at 10:56
@cᴏʟᴅsᴘᴇᴇᴅ Yup. News archives. – Rahul Iyer Jan 29 '18 at 08:02
@jp_data_analysis I have a csv file, of news articles where there are several columns (url, date, article content, author etc) – Rahul Iyer Jan 29 '18 at 08:04
And how's it stored? Obviously not on a 4TB disk. Some RAID pool? Hadoop or some database ? – OneCricketeer Jan 29 '18 at 08:08
@cricket_007 No idea! I don't have access yet. I'm supposed to "figure everything out" first :) – Rahul Iyer Jan 29 '18 at 08:11
Well, no one is going to hand you a flash drive with it, is my point. And storing it in a database would hopefully make it 1) compressed 2) reasonably searchable – OneCricketeer Jan 29 '18 at 08:15
@cricket_007 They may not give me access - instead I would have to give them my code to execute.. – Rahul Iyer Jan 29 '18 at 08:19
If it's mysql or postgres, they could give you a database dump, which you could load yourself... CSV is an awful format, is my main point. The best it's for is loading into Excel/Pandas, but then you'll just do regular SQL like operations on it, so therefore databases are generally more suited for processing – OneCricketeer Jan 29 '18 at 08:26

Can Pandas handle data frames that are larger than memory?

0 Answers0