0

I am trying to export a large dataset from SQL Server to my local hard disc for some data analysis. The file size goes up to 30gb, with 6 million over rows and about 10 columns.

This data will then be fed through python Pandas or Tableau for consumption. I am thinking the size of the file itself will give me poor performances during my analysis.

Any best practices to be shared for analyzing big-ish data on a local machine?

I am running an i7 4570 with 8gb ram. I am hoping to be less reliant on SQL queries and be able to run huge analysis offline.

Due to the nature of the database, a fresh extract needs to happen and this process will have to repeat itself, meaning there will not be much of appending happening.

I have explored HDFStores and also Tableau Data Extracts, but still curious whether I can get better performances by reading whole CSV files.

Is there a compression method of sorts that I might be missing out? Again the objective here is to run the analysis without constant querying to the server, the source itself (which I am optimizing) will refresh itself every morning so when I get in office I can just focus on getting coffee and some blazing fast analytics done.

David
  • 1,192
  • 5
  • 13
  • 30
BernardL
  • 5,162
  • 7
  • 28
  • 47
  • 2
    Isn't this a dupe of this: http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas?rq=1 – EdChum Dec 14 '15 at 11:42
  • I am not limited to exploring only Pandas, I am even open on having a small database on my machine if it is much faster. Just do not know of the best practices. – BernardL Dec 14 '15 at 11:49
  • but you then make this question opinion based as everyone will have a different answer, the dupe link deals with how to use pandas with HDFS effectively which to me makes this question very similar IMO – EdChum Dec 14 '15 at 11:51
  • 1
    Take a look at http://stackoverflow.com/questions/31167571/how-to-load-large-table-into-tableau-for-data-visualization/31171086#31171086 – Alex Blakemore Dec 14 '15 at 15:10

1 Answers1

0

With Tableau you would want to take an extract of the CSV (it will be much quicker to query than a CSV). That should be fine since the extract sits on disk. However, as mentioned, you need to create a new extract once your data changes.

With Pandas I usually load everything into memory, but if it doesn't fit then you can read the CSV in chunks using chunksize (see this thread: How to read a 6 GB csv file with pandas)

e h
  • 8,435
  • 7
  • 40
  • 58