2

I have a CSV file that is 4GB in size. When I try to open it I get "there isn't enough memory" although I have 16GB of memory and it does not consume all. I just need to get some random rows from the file to visualize. Is there a way I can do this?

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
SEC
  • 35
  • 6
  • Please use the correct, internationally agreed SI units. `GB`=gigabyte. `Gb`=gigabit. `gb`=? – Mark Setchell Sep 08 '21 at 07:37
  • Can you install anything useful, like WSL *"Windows Subsystem for Linux"* or `gawk` to make your Windows machine more useful? – Mark Setchell Sep 08 '21 at 07:39
  • You could probably take this answer https://stackoverflow.com/a/65841115/2836621 and as you get each line, generate a random number between 0..100 and if it is less than 5 print the line. That way you'll get around 5% of the lines from your original file in a new one. – Mark Setchell Sep 08 '21 at 07:47
  • 1
    I know that CSV files are as default associated with Excel on Windows because Microsoft thinks they should be. But please stop confusing the two. *Excel* files are `XLSX`, `XLS` and a few others. CSV files are usually TEXT and can be opened/ parsed by a number of tools, including stream readers. – MyICQ Sep 08 '21 at 09:50
  • Agree with @MarkSetchell, an alternative would be [AWK](https://stackoverflow.com/questions/7514896/select-random-3000-lines-from-a-file-with-awk-codes) – MyICQ Sep 08 '21 at 10:10
  • You can sample 10% of the lines in a file with `awk` by reading it one line at a time like this `awk 'rand()>0.9' YourFile` See example here https://stackoverflow.com/a/22303906/2836621 – Mark Setchell Sep 08 '21 at 10:14
  • RAM is a performance optimization. The limit you are running into is available address space. Presumably you have a 32-bit build of Excel installed which is limited to an address space of 4GB. – IInspectable Sep 08 '21 at 10:29
  • Regardless, there is no way to access a random row without parsing all rows up to it. This need not necessarily consume memory but will take time linear to the row requested. – IInspectable Sep 08 '21 at 10:34
  • **How many rows in the CSV file?** Have you tried to **IMPORT** it using either Power Query or the legacy import wizard? – Ron Rosenfeld Sep 08 '21 at 11:02
  • Just for fun, I made a 4GB CSV with 140,000,000 rows on my Mac. I sampled 1% of the rows at random in 13 seconds with `awk 'rand()>0.99' BigBoy.csv` and the peak memory resident set was 1.7MB. I imagine `awk` could run at similar speeds on Windows. – Mark Setchell Sep 08 '21 at 11:15
  • I just did a small test: I created a CSV file with five columns, each of them containing an integer value. The CSV file contains 1048576 rows, Excel's maximum. Nevertheless the whole file size is not even 12Mb. How do you manage having a 4Gb large CSV file which should contain less than Excel's maximum amount of rows? :-) – Dominique Sep 08 '21 at 11:21
  • @Dominique I don't know I created it using a python script. But now I can't read the file even using Python because it gives memory error. – SEC Sep 08 '21 at 11:30
  • If you have Python, why not just use Python to sample the data??? `with open('YourFile.csv') as f: for line in f: if randint(1,100) > 99: print(line)` – Mark Setchell Sep 08 '21 at 11:47
  • @MarkSetchell Thanks, that worked. I was trying to read the file into a pandas dataframe and Jupyter notebook kept crashing. But this way I was able to read the file. – SEC Sep 08 '21 at 13:03
  • Glad to hear to worked. I have put it as an answer so folks don't have to scrabble around in the comments looking for a solution. Good luck with your project. – Mark Setchell Sep 08 '21 at 13:47

2 Answers2

1

Apparently your file has grown that big that Excel (and according to your comment also Python) can't handle that file anymore.

So I would suggest you another approach: commandline tools.

  1. You can start working with Powershell: batch processing is quite difficult and the possibilities are quite limited.
  2. Another approach are the UNIX-like commandline tools. You can achieve those, either installing Cygwin on your PC (UNIX commands, rewritten as Windows (commandline) programs) or WSL (Windows Subsystem for Linux), which means that you install a Linux app on your PC.

On my PC, I have WSL, and I have tried to count the amount of entries in a 4Gb CSV file, and I got the answer within 2-3 seconds (without eating my RAM memory).

The feature you were talking about (showing some random rows), I just showed the 100,000th line as follows:

tail -n 100000 test.csv | head -n 1

Time consumption: almost immediate
RAM memory consumption: negligible.

Dominique
  • 16,450
  • 15
  • 56
  • 112
  • Thank you, that works. I used shuf instead of tail and I think it works as I want it to, also works fast. But apparently I was mistaken as I could use Python to read the file. – SEC Sep 08 '21 at 13:05
1

As you appear to have Python available, you can surely use Python to read the file a line at a time like this:

from random import randint

with open('YourFile.csv') as f:
    for line in f:
        if randint(1,100) > 99:
            print(line)

As approximately 1% of random numbers between 1..100 will exceed 99, this will result in a 1% sampling of lines. If you want them in a random order as well, I would shuffle them after sampling so you have so much less data to shuffle.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432