9

I have a very large 3.5 GB CSV file that I'd like to be able to read, sort through and filter for results based on various inputs. I'm pretty sure I can just import it to a MySQL database and go from there, but is there any program or online tool available that involves simply uploading the CSV and the rest is automatic?

David michael
  • 155
  • 2
  • 2
  • 8
  • I've actually worked with a couple of online services that will parse CSV but with a file that large, it's hard to say what will work. I doubt any free, hackjob CSV parser will let you upload a 3.5GB file. :) My guess is getting it into some sort of database might be your best bet. For what it's worth, I have read that tools like UltraEdit and Notepad++ can actually work with incredibly large files, but not sure if that helps. – trnelson Nov 12 '13 at 18:08
  • 1
    You will probably benefit from reading the following: http://stackoverflow.com/questions/159521/text-editor-to-open-big-giant-huge-large-text-files It's about text files in general, but you may be able to get some good suggestions. Though, I have to say, given the "database" nature of csv files, going the MySQL way does not sound like a bad idea. – James Nov 12 '13 at 18:36
  • Search for "mysql bulk load csv" for other options – rheitzman Nov 12 '13 at 19:10

10 Answers10

8

Yes, there is.

You can use OpenRefine (or Google Refine). OpenRefine is like a spreadsheet on steroids.

The file size that you can manipulate depend on your computer's memory.

Estevão Lucas
  • 4,440
  • 34
  • 37
5

Since it is a CSV file.

  1. Download http://openrefine.org/download.html
  2. It is open source. Unzip openrefine.zip.
  3. Run openrefine-2.7-rc.1\openrefine.exe.
  4. It is a web app. So open http://127.0.0.1:3333/ in Chrome.
  5. Upload the large csv file. In my case the file size was 3.61 GB and it got opened successfully.

https://snag.gy/02WECq.jpg

Siddarth Kanted
  • 5,738
  • 1
  • 29
  • 20
3

You could try PostgreSQL 9.1+ and its file_fdw (File Foreign Data Wrapper) which would pretend that the CSV file is a table. If you replaced the CSV file with another CSV file of the same name, then you would see the new info immediately in the database.

You can improve performance by using a materialized view (PG 9.3+) which essentially creates a real database table from the CSV data. You could use pgAgent to refresh the materialized view on a schedule.

Another alternative would be to use the COPY statement:

/* the columns in this table are the same as the columns in your csv: */
create table if not exists my_csv (
  some_field text, ...
);

/* COPY appends, so truncate the table if loading fresh data again: */
truncate table my_csv;

/* 
you need to be a postgres superuser to use COPY 
use psql \copy if you can't be superuser 
put the csv file in /srv/vendor-name/
*/

copy 
  my_csv 
from 
  '/srv/vendor-name/my.csv'
with (
  format csv
);
Neil McGuigan
  • 46,580
  • 12
  • 123
  • 152
2

I had the same problem with a csv-file having over 3 Million lines. Could not open in OpenOffice Calc, Writer or Notepad++.

Then I used OpenOffice 4 base as a poor mans solution, which can link to csv. Short description (wording may not be correct as I use german OpenOffice).

  1. Prepare: Your file needs .csv extension. First line should have field names. Put file as only file in a new directory to avoid confusion. Otherwise all files will be imported.
  2. File - New - Database. The assistant should come up.
  3. Connect to an existing database, format TEXT (your file needs to have .csv extension).
  4. Next. Choose path to file (oddly not the file itself). Choose csv. Choose correct field delimiters.
  5. Next and Finish.
  6. Choose a name for your newly created db.

If everything is right you now see the table view with your newly created table.

You can also use gVim to view the file like in notepad, e.g. to add the first column descriptiom line.

You may create queries on this table. As the table has no indexes it is quite slow. Since OpenOffice does not make use of the hourglass it may seem the system has crashed.

Base is very limited and feels like early beta. Create new tables in that DB is not possible (thus no insert query to select from text file).

Export to csv is not possible. Reasonably sized query results can be (time consuming) copied and pasted to calc.

Gunnar Bernstein
  • 6,074
  • 2
  • 45
  • 67
1

Sure- there are quite a few Spreadsheet-like tools that support big data - IBM BigSheets being a major example.

For an online product with a free trial period, I'd recommend Datameer I've had relatively good success with them.

RangerRick
  • 51
  • 4
1

CSV Explorer is an online tool to read, sort, and filter CSVs with millions of rows. Upload the CSV and it will automatically import it and let you start working with the data.

https://www.CSVExplorer.com

jastr
  • 881
  • 1
  • 9
  • 19
1

You could use built-in excel's connection to do this .

Original Source : https://excel.officetuts.net/en/examples/open-large-csv

Steps :

  1. Create a new excel file
  2. Navigate to Data >> Get & Transform Data >> From File >> From Text/CSV and import the CSV file.
  3. After a while, you are going to get a window with the file preview.
  4. Click the little triangle next to the load button. enter image description here
  5. Select Load To…
  6. Now, we need to create a connection and add data to the Data Model. This won’t load data to an Excel sheet because we have a limit for about a million rows there. enter image description here
  7. Click OK. This will take a while to load.
  8. On the right side, you have the name of our file and the number of rows. If you save the file, you will notice that its size increased significantly.

enter image description here

  1. Double-click this area to open Power Query Editor. enter image description here
  2. Now, if you scroll down, you will notice that the new rows are added on the go.
  3. To change a value, right-click a cell and select Replace Values. enter image description here
  4. It’s going to replace all “builders” to “roofers” under the “profession” column.
A W
  • 1,041
  • 11
  • 18
0

I had a file with ~100 million records, I used linux command line to view the files (just taking a look).

$ more myBigFile.CSV

or

$ nano myBigFile.CSV

it worked with a 6 GB file

0

If it's a flat .CSV file and it does not involve a data pipeline, I'm not exactly sure about what you mean by "the rest is automatic".

For accessing larger .CSV files, the typical solutions are

  1. Insert your .CSV file into a SQL database such as MySQL, PostgreSQL etc.

You'll need to design a table schema, find a server to host the database, and write server side code to maintain or change the database.

  1. Processing you data using Python, or R.

Running Python and R on GBs of data will put a lot of stress to your local computer. It's also better for data exploration and analytics rather than table manipulation.

  1. Find a data hub for your data. For example, Acho Studio.

A data hub is much easier but its costs may vary. It does come with a GUI that help you sort and filter through a table pretty easily.

Desmond830
  • 11
  • 2
0

You can try Acho. It's an online tool and provides a free trial as well. I recommend it because its interface looks pretty great and intuitive. Also, it has all features that you mentioned, including sorting or filtering values. Basically, I use it to shrink the size of the dataset and export it to Python to do further analysis.

Crystal L
  • 561
  • 3
  • 4