0

I have large tsv files (>1gb each) in a directory and I need to split each file into 80/20 split. per my limited knowledge on power shell I did below but its hell slow. I know I can do this in milliseconds with cygwin /bash but I need to automate this process through batch files. I am sure there is better and faster solution to this.

$DataSourceFolder="D:\Data"

$files = Get-ChildItem "$DataSourceFolder" -Filter *".tsv"

foreach ($file in $files)
    {
        $outputTrainfile="$DataSourceFolder\partitions\"+ $file.BaseName + "-train.tsv"
        $outputTestfile="$DataSourceFolder\partitions\"+ $file.BaseName + "-test.tsv"

        $filepath = "$DataSourceFolder\"+ $file
        # Get number of rows in the file
        Get-Content $filepath | Measure-Object | ForEach-Object { $sourcelinecount = $_.Count }
        # Get top and tail count to be fetched from source file
        $headlinecount = ($sourcelinecount * 80) /100
        $taillinecount = $sourcelinecount - $headlinecount

        # Create the files
        New-Item -ItemType file $outputTrainfile -force
        New-Item -ItemType file $outputTestfile -force

        #set content to the files 
        Get-Content $filepath -TotalCount $headlinecount | Set-Content $outputTrainfile
        Get-Content $filepath -Tail $taillinecount | Set-Content $outputTestfile
    }
VJSharp
  • 105
  • 11
  • Possible duplicate of [How to process a file in PowerShell line-by-line as a stream](https://stackoverflow.com/questions/4192072/how-to-process-a-file-in-powershell-line-by-line-as-a-stream) – TessellatingHeckler Sep 22 '17 at 03:59
  • Thanks for pointing to the other question, it does provide another method but I don't see that's faster either. Also, the answers suggest don't use powershell for this task , if not then what are other alternatives? I tried using python & sckit learn to split into train -test but as files are huge that also fails with out of memory error. – VJSharp Sep 22 '17 at 04:16
  • Might be of some help for you? https://github.com/dubasdey/File-Splitter – David Brabant Sep 22 '17 at 05:48
  • `Get-Content` is famously slow, and the .Net methods are much faster, it's the subject of a lot of questions and answers and blog posts and things. Although your code's approach of reading the file three times isn't ever going to be very fast. How accurate an 80/20 split do you need and how predictable is the size? Could you open the file, seek to 80% of the byte count, scan to a newline, read the remaining lines to a second file, and truncate the original? – TessellatingHeckler Sep 22 '17 at 05:55
  • Thanks. It doesn't have to be absolutely accurate , few examples here and there doesn't really matter since I am going to use this for training a model. I tried optimizing it by reading only once and get the content into a $variable but I could not get -TotalCount & -Tail work on that variable. Could you provide any sample code for the other suggestion? Appreciate your help! – VJSharp Sep 22 '17 at 17:23

1 Answers1

0

Sorry late to post the answer : Hopefully it will save efforts for others:

I used bash.exe to split the files from power shell. Fast and Furious.

Create a bash file and call it from powershell to split the files in desired partitions

Bash File: Ex: name it as "Partition.sh"

foldername=$1
filenamePrefix=$2
$echo $foldername
$echo $filenamePrefix
for filename in $foldername/$filenamePrefix*.tsv
do
 $echo "Partitioning the $filename"
 cat $filename | shuf > tmp
 lines=$(wc -l tmp | cut -d' ' -f1)
 $echo "Read file successfully"
 head -n$(echo $lines*0.8/1 | bc) tmp > $filename.train.tsv
 tail -n$(echo $lines*0.2/1-1 | bc) tmp > tmp1 > $filename.test.tsv
 rm tmp tmp1 
done

Call from powerhshell:

bash.exe /mnt/c/Partition.sh /mnt/c/trainingData/ "FilePrefix"
VJSharp
  • 105
  • 11