35

I have a very large .csv file (>500mb) and I wish to break this up into into smaller .csv files in command prompt. (Basically trying to find a linux "split" function in Windows".

This has to be a batch script as my machine only has windows installed and requesting softwares is a pain. I came across a number of sample codes (http://forums.techguy.org/software-development/1023949-split-100000-line-csv-into.html), however, it does not work when I execute the batch. All I get is one output file that is only 125kb when I requested it to parse every 20 000 lines.

Has anyone ever come across a similar problem and how did you resolve the issue?

SeekingAlpha
  • 7,489
  • 12
  • 35
  • 44
  • If you are using Windows, then it is not DOS. You are talking about the Windows Command Prompt (cmd.exe) – dbenham Dec 16 '13 at 03:17
  • Use split from the [GnuWin CoreUtils](http://gnuwin32.sourceforge.net/packages/coreutils.htm)? – seumas Dec 16 '13 at 13:14
  • Thought of that but again its a pain to obtain software at my place of work. This issue has been resolved thanks to all that contributed. – SeekingAlpha Dec 17 '13 at 00:12

6 Answers6

44

Try this out:

@echo off
setLocal EnableDelayedExpansion

set limit=20000
set file=export.csv
set lineCounter=1
set filenameCounter=1

set name=
set extension=
for %%a in (%file%) do (
    set "name=%%~na"
    set "extension=%%~xa"
)

for /f "tokens=*" %%a in (%file%) do (
    set splitFile=!name!-part!filenameCounter!!extension!
    if !lineCounter! gtr !limit! (
        set /a filenameCounter=!filenameCounter! + 1
        set lineCounter=1
        echo Created !splitFile!.
    )
    echo %%a>> !splitFile!

    set /a lineCounter=!lineCounter! + 1
)

As shown in the code above, it will split the original csv file into multiple csv file with a limit of 20 000 lines. All you have to do is to change the !file! and !limit! variable accordingly. Hope it helps.

Dale
  • 1,903
  • 1
  • 16
  • 24
  • 1
    On my PC (core i7), this takes about 30 seconds per megabyte. Give it 4 hours and it'll split your 500 MB csv. – indiv Dec 16 '13 at 04:22
  • Thanks Dale, Unfortunately this code gives the same output as the link I gave. Outputed a 126kb file with the same name and the contents were pretty much identical to what I had done previously. – SeekingAlpha Dec 16 '13 at 04:57
  • Hey Marco, may I know what is your expected output? – Dale Dec 16 '13 at 05:04
  • A file with approx 20 000 rows and 1000 columns – SeekingAlpha Dec 16 '13 at 05:21
  • I see. Simply change `set limit=200` to `set limit=20000` and it should help you to get your job done :) – Dale Dec 16 '13 at 05:28
  • 1
    You should be able to speed this up a little by on initialisation, setting `lineCounter=1+%limit%` and `filenameCounter=0`; then moving the `set splitfile` into the `if` compound, after incrementing `filenameCounter`. That way, the filename will be set only once per 20,000 lines, not for each and every line... – Magoo Dec 16 '13 at 05:56
  • 3
    Sadly, I believe this is doomed. There is a limit to the length of string (~8K) that an environment variable may contain and if you have 1,000 columns then you don't have to have much data in each column to exceed that - the column-separators would occupy 1,000 charcters for a start! Otherwise, Dale's scheme appears sound... – Magoo Dec 16 '13 at 06:01
  • turns out this doesn't work for 30Gb files...if anyone knows how can i do this with 30 or maybe even 100Gb files ? – Alireza Soori Aug 04 '15 at 09:41
  • 1
    Thanks for your script. By any chance it can be improved to handle CSV with UTF8 characters? – Mark Aug 12 '15 at 09:51
  • Working great, only issue is that its adding "ECHO is off." every X rows – JavaSheriff May 17 '16 at 18:17
  • @AlirezaSoori - I have posted [here](http://stackoverflow.com/a/36634433/2707864) a script that worked for me on files several Gb large. And **it works muuuch faster**. The downside is that it uses byte count instead of line count for splitting. – sancho.s ReinstateMonicaCellio May 18 '16 at 10:20
  • When I run this, it just opens a cmd window and promptly shuts it down? – Hairy Mar 12 '19 at 10:01
  • where do I need to keep the file, I was running the script in the folder where file is there, always I am getting an error `file not found` – Krish Nov 15 '19 at 15:47
41

A free windows app that does that

http://www.addictivetips.com/windows-tips/csv-splitter-for-windows/

Gonki
  • 567
  • 1
  • 8
  • 17
  • Nice program. Fast & easy. – Tom Collins May 25 '15 at 18:16
  • Phenomenal tool, did exactly what I needed, super fast and easy to use. – ammills01 Jun 01 '15 at 17:56
  • Great, no install needed – drtf Jul 25 '16 at 09:08
  • Beautiful. Fast. Clean. Works. Love it. – GWR Sep 01 '16 at 16:37
  • If I could give it 2 thumbs up, I would. Works well with the SQL files, you only need to rename them to .txt – dev_masta Sep 07 '16 at 00:08
  • all the comments made me think that this was too good to be true and actually a bunch of well-repped bots commenting but can confirm, works very well, very quick good solution. – John Mar 21 '18 at 10:08
  • This did the job for me, too. As others said, no installation required. Did a scan with Malwarebytes Antimalware and it's fine. Nice that it's an executable, can share with others who use supported version of Windows, don't have to remember what script to use to split CSVs, what I did with it, etc. – notacouch Mar 19 '20 at 00:22
26

Use the cgwin command SPLIT. Samples

To split a file every 500 lines counts:

split -l 500 [filename.ext]

by default, it adds xa,xb,xc... to filename after extension

To generate files with numbers and ending in correct extension, use following

split -l 1000 sourcefilename.ext destinationfilename -d --additional-suffix=.ext

the position of -d or -l does not matter,

  • "-d" is same as −−numeric−suffixes
  • "-l" is same as --lines

For more: split --help

computingfreak
  • 4,939
  • 1
  • 34
  • 51
hhh
  • 321
  • 4
  • 2
4

If splitting very large files, the solution I found is an adaptation from this, with PowerShell "embedded" in a batch file. This works fast, as opposed to many other things I tried (I wouldn't know about other options posted here).

The way to use mysplit.bat below is

mysplit.bat <mysize> 'myfile'

Note: The script was intended to use the first argument as the split size. It is currently hardcoded at 100Mb. It should not be difficult to fix this.

Note 2: The filname should be enclosed in single quotes. Other alternatives for quoting apparently do not work.

Note 3: It splits the file at given number of bytes, not at given number of lines. For me this was good enough. Some lines of code could be probably added to complete each chunk read, up to the next CR/LF. This will split in full lines (not with a constant number of them), with no sacrifice in processing time.

Script mysplit.bat:

@REM Using https://stackoverflow.com/questions/19335004/how-to-run-a-powershell-script-from-a-batch-file
@REM and https://stackoverflow.com/questions/1001776/how-can-i-split-a-text-file-using-powershell
@PowerShell  ^
    $upperBound = 100MB;  ^
    $rootName = %2;  ^
    $from = $rootName;  ^
    $fromFile = [io.file]::OpenRead($from);  ^
    $buff = new-object byte[] $upperBound;  ^
    $count = $idx = 0;  ^
    try {  ^
        do {  ^
            'Reading ' + $upperBound;  ^
            $count = $fromFile.Read($buff, 0, $buff.Length);  ^
            if ($count -gt 0) {  ^
                $to = '{0}.{1}' -f ($rootName, $idx);  ^
                $toFile = [io.file]::OpenWrite($to);  ^
                try {  ^
                    'Writing ' + $count + ' to ' + $to;  ^
                    $tofile.Write($buff, 0, $count);  ^
                } finally {  ^
                    $tofile.Close();  ^
                }  ^
            }  ^
            $idx ++;  ^
        } while ($count -gt 0);  ^
    }  ^
    finally {  ^
        $fromFile.Close();  ^
    }  ^
%End PowerShell%
Community
  • 1
  • 1
  • 1
    splitting a `csv` by a number of bytes might not be the best of all ideas. Lines shouldn't be splitted. – Stephan Apr 15 '16 at 06:10
  • @Stephan - I think your statement may or may not be true. It all depends on what one needs. I actually needed splitting a huge csv, and doing it at prescribed line numbers or prescribed byte numbers worked equally well. That was actually the need that led me to putting together this script. – sancho.s ReinstateMonicaCellio Apr 16 '16 at 04:12
  • of course it depends on your needs. Just wanted to note. A `csv` ending or starting in the middle of a value somewhere in the middle of a line seems like a nightmare to me. But f that doesn't matter, this seems to be a good answer (and the only one that does split at size instead of linenumbers). – Stephan Apr 16 '16 at 05:21
  • @Stephan - The tradeoff (if there is any tradeoff actually) speed vs. "accuracy" (sacrificing line count in favor of byte count) may be worth it. [This comment](http://stackoverflow.com/questions/20602869/batch-file-to-split-csv-file/36634433?noredirect=1#comment30829559_20603219) quotes splitting with the accepted answer a 500 MB csv file, in an i7 (unknown speed), in 4 hours. I had split with my script a 4.5Gb txt file, in an i5, in about 2 minutes. – sancho.s ReinstateMonicaCellio Apr 16 '16 at 12:32
0

This will give you lines 1 to 20000 in newfile1.csv
and lines 20001 to the end in file newfile2.csv

It overcomes the 8K character limit per line too.

This uses a helper batch file called findrepl.bat from - https://www.dropbox.com/s/rfdldmcb6vwi9xc/findrepl.bat

Place findrepl.bat in the same folder as the batch file or on the path.

It's more robust than a plain batch file, and quicker too.

findrepl /o:1:20000 <file.csv >newfile1.csv
findrepl /o:20001   <file.csv >newfile2.csv
foxidrive
  • 40,353
  • 10
  • 53
  • 68
0

I found this question while looking for a similar solution. I modified the answer that @Dale gave to suit my purposes. I wanted something that was a little more flexible and had some error trapping. Just thought I might put it here for anyone looking for the same thing.

@echo off
setLocal EnableDelayedExpansion
GOTO checkvars

:checkvars
    IF "%1"=="" GOTO syntaxerror
    IF NOT "%1"=="-f"  GOTO syntaxerror
    IF %2=="" GOTO syntaxerror
    IF NOT EXIST %2 GOTO nofile
    IF "%3"=="" GOTO syntaxerror
    IF NOT "%3"=="-n" GOTO syntaxerror
    IF "%4"==""  GOTO syntaxerror
    set param=%4
    echo %param%| findstr /xr "[1-9][0-9]* 0" >nul && (
        goto proceed
    ) || (
        echo %param% is NOT a valid number
        goto syntaxerror
    )

:proceed
    set limit=%4
    set file=%2
    set lineCounter=1+%limit%
    set filenameCounter=0

    set name=
    set extension=

    for %%a in (%file%) do (
        set "name=%%~na"
        set "extension=%%~xa"
    )

    for /f "usebackq tokens=*" %%a in (%file%) do (
        if !lineCounter! gtr !limit! (
            set splitFile=!name!_part!filenameCounter!!extension!
            set /a filenameCounter=!filenameCounter! + 1
            set lineCounter=1
            echo Created !splitFile!.
        )
        cls
        echo Adding Line !splitFile! - !lineCounter!
        echo %%a>> !splitFile!
        set /a lineCounter=!lineCounter! + 1
    )
    echo Done!
    goto end
:syntaxerror
    Echo Syntax: %0 -f Filename -n "Number Of Rows Per File"
    goto end
:nofile
    echo %2 does not exist
    goto end
:end
Community
  • 1
  • 1
SuperMykEl
  • 613
  • 3
  • 15