0

I have a 3GB dataset with 40k rows and 60k columns which Pandas is unable to read and I would like to melt the file based on the current index.

The current file looks like this:

enter image description here

The first column is an index and I would like to melt all the file based on this index. I tried pandas and dask, but all of them crush when reading the big file.

Do you have any suggestions? thanks

ᴀʀᴍᴀɴ
  • 4,443
  • 8
  • 37
  • 57
Alicia Pliego
  • 181
  • 10
  • 1
    `dask` has a built-in `melt` function that does exactly what `pandas` has implemented. – Fourier Jan 22 '20 at 13:44
  • 2
    @Fourier It's specifically stating that both pandas and dask crash when reading the file – ChatterOne Jan 22 '20 at 13:51
  • @ChatterOne This skipped my attention. Then more details are needed, is the `dtype` specified when reading in? Do you use chunking to process the data in chunks? – Fourier Jan 22 '20 at 14:00
  • When I load it with dask I get the following error: Sample is not large enough to include at least one row of data. Please increase the number of bytes in `sample` in the call to `read_csv`/`read_table` AND when I load it in chunks it crush when I melt the file – Alicia Pliego Jan 22 '20 at 14:02
  • @AliciaPliego what do you use as `sample` parameter for `dd.read_table` ? – Fourier Jan 22 '20 at 14:05
  • I used sample= 100 and also 1000, 100000 and 1000000 – Alicia Pliego Jan 22 '20 at 14:07
  • 1
    @AliciaPliego `sample` is defined in bytes. With 60K+ columns try 4e6 to be on the safe side. And I suppose your data is all `int`? Try to set this parameter when reading as well. – Fourier Jan 22 '20 at 14:10
  • How do you deal with the partitions created? – Alicia Pliego Jan 22 '20 at 14:24
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/206452/discussion-between-fourier-and-alicia-pliego). – Fourier Jan 22 '20 at 14:29

1 Answers1

1

You need to use the chunksize property of pandas. See for example How to read a 6 GB csv file with pandas.

You will process N rows at one time, without loading the whole dataframe. N will depend on your computer: if N is low, it will cost less memory but it will increase the run time and will cost more IO load.

# create an object reading your file 100 rows at a time
reader = pd.read_csv( 'bigfile.tsv', sep='\t', header=None, chunksize=100 )
# process each chunk at a time
for chunk in file:
    result = chunk.melt()
    # export the results into a new file
    result.to_csv( 'bigfile_melted.tsv', header=None, sep='\t', mode='a' )

Furthermore, you can use the argument dtype=np.int32 for read_csv if you have integer or dtype=np.float32 to process data faster if you do not need precision.

NB: here you have examples of memory usage: Using Chunksize in Pandas.

Michaël
  • 181
  • 5