How do I read a large csv file in Google Colab?

Question

So my csv file is stored in the local google colab directory. It is about 3.31 gb in size. When I run the following line of code:

truthdata = pd.read_csv("out.csv",header=0)

The session runs out of memory and reconnects. Please let me know how I can read this large csv file into a pandas dataframe. Thanks !!

You might try to process it in "chunks": like loading 10% in memory, doing some filtering, etc. loading the next chunk in memory, etc. — Willem Van Onsem, Aug 16 '19 at 21:23
This SO post might help: https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file. Specifically providing a solution using `pandas`: https://stackoverflow.com/a/43286094/6143017 — Adrian, Aug 16 '19 at 21:24
@WillemVanOnsem is there a way to do this on Colab ? Thanks. — Aditya Lahiri, Aug 16 '19 at 21:24
@Adrian: thanks for the link, I might be able to work out a solution from it. — Aditya Lahiri, Aug 16 '19 at 21:27

score 1 · Answer 1 · answered Aug 16 '19 at 21:27

The resources of google collab are limited to 12GB of RAM. Things you can do:

Use usecols or nrows arguments in the pd.read_csvfunction to limit the number of columns and rows to read. That will decrease the memory
Read the file by chunks and reduce the memory of each chunk using the following function. Afterwards pd.concat the chuncks

The code is not mine, I copied it from the following link and then tweaked it! https://www.mikulskibartosz.name/how-to-reduce-memory-usage-in-pandas/

def reduce_mem_usage(df, int_cast=True, obj_to_category=False, subset=None):
    """
    Iterate through all the columns of a dataframe and modify the data type to reduce memory usage.
    :param df: dataframe to reduce (pd.DataFrame)
    :param int_cast: indicate if columns should be tried to be casted to int (bool)
    :param obj_to_category: convert non-datetime related objects to category dtype (bool)
    :param subset: subset of columns to analyse (list)
    :return: dataset with the column dtypes adjusted (pd.DataFrame)
    """
    start_mem = df.memory_usage().sum() / 1024 ** 2;
    gc.collect()
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    cols = subset if subset is not None else df.columns.tolist()

    for col in tqdm(cols):
        col_type = df[col].dtype

        if col_type != object and col_type.name != 'category' and 'datetime' not in col_type.name:
            c_min = df[col].min()
            c_max = df[col].max()

            # test if column can be converted to an integer
            treat_as_int = str(col_type)[:3] == 'int'
            if int_cast and not treat_as_int:
                treat_as_int = check_if_integer(df[col])

            if treat_as_int:
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.uint8).min and c_max < np.iinfo(np.uint8).max:
                    df[col] = df[col].astype(np.uint8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.uint16).min and c_max < np.iinfo(np.uint16).max:
                    df[col] = df[col].astype(np.uint16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.uint32).min and c_max < np.iinfo(np.uint32).max:
                    df[col] = df[col].astype(np.uint32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
                elif c_min > np.iinfo(np.uint64).min and c_max < np.iinfo(np.uint64).max:
                    df[col] = df[col].astype(np.uint64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        elif 'datetime' not in col_type.name and obj_to_category:
            df[col] = df[col].astype('category')
    gc.collect()
    end_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage after optimization is: {:.3f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

score 0 · Answer 2 · answered Aug 16 '19 at 23:18

0

It depends on what exactly you want to do. In general read_csv has a parameter called chunksize that allows you to iterate over chunks of the data. This is typically the approach to work with big files efficiently.

answered Aug 16 '19 at 23:18

Alex Fish

768
6
18

The csv file in question is a test data. I have a trained classfier and I would like to make predictions for each row of the csv file. – Aditya Lahiri Aug 17 '19 at 01:41
You can load chunks of as many rows as you want and then `.apply(predict, axis=1)` to run your predictor on each row. – Alex Fish Aug 19 '19 at 03:37

score 0 · Answer 3 · answered Sep 08 '21 at 01:50

So I am having the same problem, I can load a csv file a very large one with these commands.

    from google.colab import files
    uploaded = files.upload()
    

     //I am using UFO data from nuforc...Its really huge..but thats the 
     name of my csv file.

    import io
    df2 = pd.read_csv(io.BytesIO(uploaded['nuforc_reports.csv']))

The only problem I found is juggling dtypes that happen with large sets of data...still haven't figured mine out yet, I have alot of issues with that right now...but those three commands typically work with google colabs on large sets of data. Go get a cup of coffee because it will process for a minute-

How do I read a large csv file in Google Colab?

3 Answers3