1

Say I have a CSV file which is 20TB.

Is there a way to load that into a data frame on a machine with only 16GB of memory ?

For example, what if I wanted to do:

data = pd.read_csv(csv_path)
data = data.drop_duplicates(content_column_name)
Rahul Iyer
  • 19,924
  • 21
  • 96
  • 190
  • 2
    No it won't work, you'll get an out of memory error., you should use `pytables` for this kind of operation see https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas – EdChum Jan 26 '18 at 10:53
  • Define "load". If you mean, "load into memory", probably not. I can contrive a situation where it could work. Let's say you have a column of X mio rows with one very long string. If you read it in as a category, then it would only store single integer identifiers in that column. This is, of course, lossless. – jpp Jan 26 '18 at 10:54
  • 1
  • 2
    Are you sure it's 20 *TB*? – cs95 Jan 26 '18 at 10:56
  • @cᴏʟᴅsᴘᴇᴇᴅ Yup. News archives. – Rahul Iyer Jan 29 '18 at 08:02
  • @jp_data_analysis I have a csv file, of news articles where there are several columns (url, date, article content, author etc) – Rahul Iyer Jan 29 '18 at 08:04
  • And how's it stored? Obviously not on a 4TB disk. Some RAID pool? Hadoop or some database ? – OneCricketeer Jan 29 '18 at 08:08
  • @cricket_007 No idea! I don't have access yet. I'm supposed to "figure everything out" first :) – Rahul Iyer Jan 29 '18 at 08:11
  • Well, no one is going to hand you a flash drive with it, is my point. And storing it in a database would hopefully make it 1) compressed 2) reasonably searchable – OneCricketeer Jan 29 '18 at 08:15
  • @cricket_007 They may not give me access - instead I would have to give them my code to execute.. – Rahul Iyer Jan 29 '18 at 08:19
  • If it's mysql or postgres, they could give you a database dump, which you could load yourself... CSV is an awful format, is my main point. The best it's for is loading into Excel/Pandas, but then you'll just do regular SQL like operations on it, so therefore databases are generally more suited for processing – OneCricketeer Jan 29 '18 at 08:26

0 Answers0