0

I am attempting to speed up the loading of my data to clickhouse hosted in docker on Windows 10 via powershell and I'm wondering if I can leverage a parallel process to have 4 files loading at the same time. I am interested in getting some help to understand if this is possible and or some pointers on how to approach. Below is the current script I use to load data:

$files = Get-ChildItem "my_directory" | Sort-Object

foreach ($f in $files){
    $outfile = $f.FullName | Write-Host
    Get-Date | Write-Host    
    "Start loading" + $f.FullName | Write-Host
    `cat $f.FullName | docker run -i --rm --link ch:clickhouse-client yandex/clickhouse-client -m --host ch --query="INSERT INTO my_table FORMAT CSV"`
     Get-Date | Write-Host 
    "End loading" + $f.FullName | Write-Host
    [GC]::Collect()
}

I am loading files one by one and I would like to load 4 at a time. Based off of this link:

Running tasks parallel in powershell

I've tried to put together the code but could use a little help to see if I am on the right track:

#I am assuming this is the code block of what to do
$block = {
    $outfile = $f.FullName | Write-Host
Get-Date | Write-Host    
"Start loading" + $f.FullName | Write-Host
`cat $f.FullName | docker run -i --rm --link ch:clickhouse-client yandex/clickhouse-client -m --host ch --query="INSERT INTO my_table FORMAT CSV"`
 Get-Date | Write-Host 
"End loading" + $f.FullName | Write-Host
[GC]::Collect())
}

#my directory of files
$files = Get-ChildItem "my_directory" | Sort-Object| Sort-Object

#Remove all jobs
Get-Job | Remove-Job
$MaxThreads = 4
#Start the jobs. Max 4 jobs running simultaneously.
foreach($f in $files){
    While ($(Get-Job -state running).count -ge $MaxThreads){
        Start-Sleep -Milliseconds 3
    }
    Start-Job -Scriptblock $Block -ArgumentList $f
}
#Wait for all jobs to finish.
While ($(Get-Job -State Running).count -gt 0){
    start-sleep 1
}
#Get information from each job.
foreach($job in Get-Job){
    $info= Receive-Job -Id ($job.Id)
}
#Remove all jobs created.
Get-Job | Remove-Job

New to powershell, appreciate the help.

LoF10
  • 1,907
  • 1
  • 23
  • 64
  • How many rows each file have? – ramazan polat Aug 07 '19 at 17:21
  • approx 500k rows. I believe I was able to set up the jobs to run in parallel, placed some code below. Seems to be working so far. Monitoring my CPU and RAM I can see the computer spreading more power across as it reads and loads data. – LoF10 Aug 07 '19 at 21:19

1 Answers1

0

I believe I solved this issue:

#create my block
$direc = "my_direc"
$block = {
    param([string]$file)
    `cat $direc/$file | docker run -i --rm --link ch:clickhouse-client yandex/clickhouse-client -m --host ch --query="INSERT INTO test FORMAT CSV"`  
     [GC]::Collect()
}

#Remove all jobs
Get-Job | Remove-Job
$MaxThreads = 4
#Start the jobs. Max 6 jobs running simultaneously.
foreach($file in $files){
    While ($(Get-Job -state running).count -ge $MaxThreads){
        Start-Sleep -Milliseconds 3
    }
    Start-Job -Scriptblock $Block -ArgumentList $file
}
#Wait for all jobs to finish.
While ($(Get-Job -State Running).count -gt 0){
    start-sleep 1
}
#Get information from each job.
foreach($job in Get-Job){
    $info= Receive-Job -Id ($job.Id)
}
#Remove all jobs created.
Get-Job | Remove-Job
LoF10
  • 1,907
  • 1
  • 23
  • 64