Melt a big data frame without pandas

Question

I have a 3GB dataset with 40k rows and 60k columns which Pandas is unable to read and I would like to melt the file based on the current index.

The current file looks like this:

The first column is an index and I would like to melt all the file based on this index. I tried pandas and dask, but all of them crush when reading the big file.

Do you have any suggestions? thanks

`dask` has a built-in `melt` function that does exactly what `pandas` has implemented. — Fourier, Jan 22 '20 at 13:44
@Fourier It's specifically stating that both pandas and dask crash when reading the file — ChatterOne, Jan 22 '20 at 13:51
@ChatterOne This skipped my attention. Then more details are needed, is the `dtype` specified when reading in? Do you use chunking to process the data in chunks? — Fourier, Jan 22 '20 at 14:00
When I load it with dask I get the following error: Sample is not large enough to include at least one row of data. Please increase the number of bytes in `sample` in the call to `read_csv`/`read_table` AND when I load it in chunks it crush when I melt the file — Alicia Pliego, Jan 22 '20 at 14:02
@AliciaPliego what do you use as `sample` parameter for `dd.read_table` ? — Fourier, Jan 22 '20 at 14:05
@AliciaPliego `sample` is defined in bytes. With 60K+ columns try 4e6 to be on the safe side. And I suppose your data is all `int`? Try to set this parameter when reading as well. — Fourier, Jan 22 '20 at 14:10
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/206452/discussion-between-fourier-and-alicia-pliego). — Fourier, Jan 22 '20 at 14:29

Michaël · Accepted Answer · 2020-01-22T15:45:20.057

You need to use the chunksize property of pandas. See for example How to read a 6 GB csv file with pandas.

You will process N rows at one time, without loading the whole dataframe. N will depend on your computer: if N is low, it will cost less memory but it will increase the run time and will cost more IO load.

# create an object reading your file 100 rows at a time
reader = pd.read_csv( 'bigfile.tsv', sep='\t', header=None, chunksize=100 )
# process each chunk at a time
for chunk in file:
    result = chunk.melt()
    # export the results into a new file
    result.to_csv( 'bigfile_melted.tsv', header=None, sep='\t', mode='a' )

Furthermore, you can use the argument dtype=np.int32 for read_csv if you have integer or dtype=np.float32 to process data faster if you do not need precision.

NB: here you have examples of memory usage: Using Chunksize in Pandas.

Melt a big data frame without pandas

1 Answers1