write_csv_rows() very slow on large dataset with long strings

Question

I'm trying to export a dataframe that has 19 columns and about 150k rows to CSV using pandas to_csv(). One of the columns contains strings that are quite long in some cases (like 1000 characters or so). I'm facing extremely long export time. I have never reached the end but exporting first 1000 rows takes almost 200 seconds (and the resulting file is only 185 kilobytes!).

I'm working on quite powerful ec2 machine so hardware performance should not be an issue. The file is saving few kilobytes per second so I think I'm also not hitting the I/O limits. When I try to profile the export of first 1000 rows it turns out that pandas._libs.lib.write_csv_rows takes almost the entire execution time (profiling results attached). Is there any way I can quickly export such a frame to CSV in Python?

>>> profile.print_stats()
         709 function calls (707 primitive calls) in 196.859 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:103(release)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:143(__init__)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:147(__enter__)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:151(__exit__)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:176(cb)
       20    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:222(_verbose_message)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:369(__init__)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:58(__init__)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:707(find_spec)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:78(acquire)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:780(find_spec)
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:843(__enter__)
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:847(__exit__)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:861(_find_spec_legacy)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:870(_find_spec)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:966(_find_and_load)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1080(_path_importer_cache)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1117(_get_spec)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1149(find_spec)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1233(find_spec)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:41(_relax_case)
       20    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:57(_path_join)
       20    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:59(<listcomp>)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:75(_path_stat)
        1    0.000    0.000  196.859  196.859 <stdin>:1(<module>)
        2    0.000    0.000    0.000    0.000 __init__.py:200(iteritems)
        1    0.000    0.000    0.000    0.000 _methods.py:37(_any)
        4    0.000    0.000    0.000    0.000 apipkg.py:133(__makeattr)
        1    0.000    0.000    0.000    0.000 base.py:1108(nlevels)
        2    0.000    0.000    0.000    0.000 base.py:1336(is_object)
        1    0.000    0.000    0.000    0.000 base.py:1408(_convert_slice_indexer)
        2    0.000    0.000    0.001    0.000 base.py:1984(to_native_types)
        2    0.000    0.000    0.001    0.000 base.py:2010(_format_native_types)
        3    0.000    0.000    0.000    0.000 base.py:3482(_validate_indexer)
        2    0.000    0.000    0.000    0.000 base.py:4155(_ensure_index)
        2    0.000    0.000    0.000    0.000 base.py:551(_reset_identity)
        2    0.000    0.000    0.000    0.000 base.py:557(__len__)
        2    0.000    0.000    0.000    0.000 base.py:563(__array__)
        2    0.000    0.000    0.000    0.000 base.py:588(values)
        1    0.000    0.000    0.000    0.000 codecs.py:185(__init__)
        1    0.000    0.000    0.000    0.000 common.py:102(_expand_user)
        3    0.000    0.000    0.000    0.000 common.py:1136(is_datetime_or_timedelta_dtype)
        2    0.000    0.000    0.000    0.000 common.py:128(_stringify_path)
        3    0.000    0.000    0.000    0.000 common.py:1371(needs_i8_conversion)
        2    0.000    0.000    0.000    0.000 common.py:1456(is_string_like_dtype)
        7    0.000    0.000    0.000    0.000 common.py:1722(_get_dtype)
        5    0.000    0.000    0.000    0.000 common.py:1773(_get_dtype_type)
        1    0.000    0.000    0.000    0.000 common.py:184(is_bool_indexer)
        1    0.000    0.000    0.001    0.001 common.py:291(_get_handle)
        3    0.000    0.000    0.000    0.000 common.py:334(is_datetime64tz_dtype)
        5    0.000    0.000    0.000    0.000 common.py:409(is_period_dtype)
        2    0.000    0.000    0.000    0.000 common.py:442(is_interval_dtype)
        1    0.000    0.000    0.000    0.000 common.py:464(_apply_if_callable)
        2    0.000    0.000    0.000    0.000 common.py:478(is_categorical_dtype)
        1    0.000    0.000    0.000    0.000 common.py:488(UnicodeWriter)
        5    0.000    0.000    0.000    0.000 common.py:511(is_string_dtype)
        2    0.000    0.000    0.000    0.000 common.py:85(is_object_dtype)
        5    0.000    0.000    0.000    0.000 dtypes.py:556(is_dtype)
        2    0.000    0.000    0.000    0.000 dtypes.py:678(is_dtype)
       12    0.000    0.000    0.000    0.000 dtypes.py:85(is_dtype)
        1    0.000    0.000    0.000    0.000 format.py:1526(__init__)
        4    0.000    0.000    0.000    0.000 format.py:1604(<genexpr>)
        1    0.000    0.000  196.858  196.858 format.py:1621(save)
        1    0.000    0.000    0.000    0.000 format.py:1658(_save_header)
        1    0.000    0.000  196.858  196.858 format.py:1738(_save)
        1    0.000    0.000  196.858  196.858 format.py:1756(_save_chunk)
        1    0.000    0.000  196.859  196.859 frame.py:1433(to_csv)
        1    0.000    0.000    0.000    0.000 frame.py:303(_constructor)
        1    0.000    0.000    0.000    0.000 frame.py:316(__init__)
        1    0.000    0.000    0.000    0.000 generic.py:120(__init__)
        1    0.000    0.000    0.000    0.000 generic.py:162(_init_mgr)
        1    0.000    0.000    0.000    0.000 generic.py:1804(_indexer)
        1    0.000    0.000    0.000    0.000 generic.py:1937(_slice)
        1    0.000    0.000    0.000    0.000 generic.py:1957(_set_is_copy)
        1    0.000    0.000    0.001    0.001 generic.py:3250(head)
        1    0.000    0.000    0.000    0.000 generic.py:346(_get_axis_number)
        1    0.000    0.000    0.000    0.000 generic.py:3583(__finalize__)
        1    0.000    0.000    0.000    0.000 generic.py:359(_get_axis_name)
        1    0.000    0.000    0.000    0.000 generic.py:3616(__setattr__)
        1    0.000    0.000    0.000    0.000 generic.py:372(_get_axis)
        1    0.000    0.000    0.000    0.000 generic.py:376(_get_block_manager_axis)
       19    0.000    0.000    0.000    0.000 generic.py:7(_check)
        1    0.000    0.000    0.001    0.001 indexing.py:1358(__getitem__)
        1    0.000    0.000    0.000    0.000 indexing.py:152(_slice)
        1    0.000    0.000    0.000    0.000 indexing.py:1658(_has_valid_type)
        1    0.000    0.000    0.001    0.001 indexing.py:1764(_get_slice_axis)
        1    0.000    0.000    0.001    0.001 indexing.py:1799(_getitem_axis)
        1    0.000    0.000    0.000    0.000 indexing.py:2148(need_slice)
        1    0.000    0.000    0.000    0.000 indexing.py:258(_convert_slice_indexer)
        3    0.000    0.000    0.000    0.000 internals.py:107(__init__)
        1    0.000    0.000    0.001    0.001 internals.py:1846(to_native_types)
       12    0.000    0.000    0.000    0.000 internals.py:189(mgr_locs)
        1    0.000    0.000    0.000    0.000 internals.py:2076(__init__)
        3    0.000    0.000    0.000    0.000 internals.py:218(make_block_same_class)
        3    0.000    0.000    0.000    0.000 internals.py:226(mgr_locs)
        3    0.000    0.000    0.000    0.000 internals.py:260(_slice)
        3    0.000    0.000    0.000    0.000 internals.py:279(getitem_block)
        3    0.000    0.000    0.000    0.000 internals.py:2921(make_block)
        3    0.000    0.000    0.000    0.000 internals.py:299(shape)
        1    0.000    0.000    0.000    0.000 internals.py:3017(__init__)
        1    0.000    0.000    0.000    0.000 internals.py:3018(<listcomp>)
        2    0.000    0.000    0.000    0.000 internals.py:3058(shape)
        6    0.000    0.000    0.000    0.000 internals.py:3060(<genexpr>)
        4    0.000    0.000    0.000    0.000 internals.py:3062(ndim)
        3    0.000    0.000    0.000    0.000 internals.py:307(dtype)
        3    0.000    0.000    0.000    0.000 internals.py:311(ftype)
        1    0.000    0.000    0.000    0.000 internals.py:3114(_rebuild_blknos_and_blklocs)
        1    0.000    0.000    0.000    0.000 internals.py:3524(is_consolidated)
        1    0.000    0.000    0.000    0.000 internals.py:3532(_consolidate_check)
        1    0.000    0.000    0.000    0.000 internals.py:3533(<listcomp>)
        1    0.000    0.000    0.000    0.000 internals.py:3612(get_slice)
        1    0.000    0.000    0.000    0.000 internals.py:3622(<listcomp>)
        1    0.000    0.000    0.000    0.000 internals.py:3829(_consolidate_inplace)
        2    0.000    0.000    0.006    0.003 internals.py:714(to_native_types)
        5    0.000    0.000    0.001    0.000 missing.py:123(_isna_ndarraylike)
        5    0.000    0.000    0.001    0.000 missing.py:26(isna)
        5    0.000    0.000    0.001    0.000 missing.py:51(_isna_new)
        1    0.000    0.000    0.000    0.000 numeric.py:424(asarray)
        3    0.000    0.000    0.000    0.000 numeric.py:621(require)
        6    0.000    0.000    0.000    0.000 numeric.py:692(<genexpr>)
        1    0.000    0.000    0.000    0.000 posixpath.py:230(expanduser)
        2    0.000    0.000    0.000    0.000 range.py:119(_simple_new)
        1    0.000    0.000    0.000    0.000 range.py:157(_data)
        6    0.000    0.000    0.000    0.000 range.py:224(dtype)
        5    0.000    0.000    0.000    0.000 range.py:469(__len__)
        2    0.000    0.000    0.000    0.000 range.py:479(__getitem__)
        2    0.000    0.000    0.000    0.000 range.py:56(__new__)
        2    0.000    0.000    0.000    0.000 six.py:184(find_module)
        2    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x5581996fad40}
        1    0.000    0.000    0.000    0.000 {built-in method _csv.writer}
        7    0.000    0.000    0.000    0.000 {built-in method _imp.acquire_lock}
        1    0.000    0.000    0.000    0.000 {built-in method _imp.is_builtin}
        1    0.000    0.000    0.000    0.000 {built-in method _imp.is_frozen}
        7    0.000    0.000    0.000    0.000 {built-in method _imp.release_lock}
        2    0.000    0.000    0.000    0.000 {built-in method _thread.allocate_lock}
        2    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.callable}
       27    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
       21    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
      102    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.iter}
    26/24    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        5    0.000    0.000    0.000    0.000 {built-in method builtins.max}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.min}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {built-in method io.open}
        4    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.arange}
        6    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
        4    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.empty}
        4    0.000    0.000    0.000    0.000 {built-in method pandas._libs.lib.is_integer}
        7    0.000    0.000    0.000    0.000 {built-in method pandas._libs.lib.isscalar}
        1    0.000    0.000    0.000    0.000 {built-in method posix.fspath}
        1    0.000    0.000    0.000    0.000 {built-in method posix.getcwd}
        4    0.000    0.000    0.000    0.000 {built-in method posix.stat}
        1    0.000    0.000    0.000    0.000 {method 'any' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        3    0.007    0.002    0.007    0.002 {method 'astype' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'close' of '_io.TextIOWrapper' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 {method 'fill' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'format' of 'str' objects}
        4    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        3    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
       20    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {method 'ravel' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        2    0.000    0.000    0.000    0.000 {method 'reshape' of 'numpy.ndarray' objects}
        5    0.000    0.000    0.000    0.000 {method 'rpartition' of 'str' objects}
       40    0.000    0.000    0.000    0.000 {method 'rstrip' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
        3    0.000    0.000    0.000    0.000 {method 'upper' of 'str' objects}
        4    0.000    0.000    0.000    0.000 {method 'view' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'writerow' of '_csv.writer' objects}
        2    0.000    0.000    0.000    0.000 {pandas._libs.lib.isnaobj}
        1  196.850  196.850  196.850  196.850 {pandas._libs.lib.write_csv_rows}

Did you check this [question](https://stackoverflow.com/questions/40660331/pandas-to-csv-slow-saving-large-dataframe) — Emmanuel-Lin, Jun 08 '18 at 13:13
Check this link. I use dask often... http://pythondata.com/dask-large-csv-python/ — Goran, Jun 08 '18 at 13:26
Have you looked at other search results to see if any fit your circumstances. A web search for `python pandas dataframe to_csv slow` provides a number of SO Q&A's like [What is the fastest way to output large DataFrame into a CSV file?](https://stackoverflow.com/questions/15417574/what-is-the-fastest-way-to-output-large-dataframe-into-a-csv-file). The [docs have some i/o performance meterics/code](http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations) you might want to use to compare. The character of your data may also be a factor. — wwii, Jun 08 '18 at 13:29
If you want to rule out disk i/o factors and you have enough memory, you could pass an in-memory stream/buffer - [`io.StringIO`](https://docs.python.org/3/library/io.html#io.StringIO). `f = io.StringIO(); df.to_csv(f, ...)` — wwii, Jun 08 '18 at 13:44
This doesn't look in the realm of the other questions to me, it looks buggy based on the file size for 1000 rows but taking 200 seconds. Can you give me the character lengths of a couple of rows (i.e. for each column in a couple of rows) so I can try reproduce? What do you get if you try the regular `csv` module and `writerows(df.as_matrix)` for your 1000 rows? — roganjosh, Jun 08 '18 at 15:03
Also, for the 1000 rows, is that a separate df to the original or a subset of it e.g. through `iloc`? — roganjosh, Jun 08 '18 at 15:08
Do your strings contain repeats or are they mostly distinct? — jpp, Jun 08 '18 at 15:58

write_csv_rows() very slow on large dataset with long strings

0 Answers0