1

I use colab and run cudf.read_csv() with a huge csv file (3GB, 17540000 records),but the result is wrong.

import cudf
import numpy as np
import pandas as pd
import csv
g_df = cudf.read_csv('drive/MyDrive/m1.csv',escapechar="\\")

error message is

RuntimeError                              Traceback (most recent call last)
<ipython-input-9-efc4c69ac697> in <module>()
----> 1 g_df = cudf.read_csv('drive/MyDrive/m1.csv',escapechar="\\")
      2 g_df.shape

1 frames
/usr/local/lib/python3.7/site-packages/cudf/io/csv.py in read_csv(filepath_or_buffer, lineterminator, quotechar, quoting, doublequote, header, mangle_dupe_cols, usecols, sep, delimiter, delim_whitespace, skipinitialspace, names, dtype, skipfooter, skiprows, dayfirst, compression, thousands, decimal, true_values, false_values, nrows, byte_range, skip_blank_lines, parse_dates, comment, na_values, keep_default_na, na_filter, prefix, index_col, **kwargs)
    100         na_filter=na_filter,
    101         prefix=prefix,
--> 102         index_col=index_col,
    103     )
    104 

cudf/_lib/csv.pyx in cudf._lib.csv.read_csv()

RuntimeError: cuDF failure at: ../include/cudf/strings/detail/strings_column_factories.cuh:75: total size of strings is too large for cudf column
vicpython
  • 11
  • 1
  • A single cudf string column can only hold a total of max(int32) individual characters. Use dask-cudf: https://docs.rapids.ai/api/cudf/nightly/user_guide/10min.html – Nick Becker Sep 05 '21 at 16:28

2 Answers2

1

You can use Pandas to read you CSV into chucks I take a sample form here How do I read a large csv file with pandas?

chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
    for chunk in reader:
        process(chunk)

I will recommend you to read it into chuck, if you did not do this its required a very high amount of memory and if you read in chuck you can save a lot of memeory this way.

ParisNakitaKejser
  • 12,112
  • 9
  • 46
  • 66
0

I faced the same issue with a file of similar size. After all the reason was I forgot to put delimiter parameter. It could be also your case, just try to put delimiter parameter when you call read_csv.

Jeremy Caney
  • 7,102
  • 69
  • 48
  • 77