1

I need to be able to merge as separate columns files with millions of lines. I tried using the suggested code here:

@echo off
set f1=1.txt
set f2=2.txt
set outfile=mix.txt
type nul>%outfile%
(
    for /f "delims=" %%a in (%f1%) do (
        setlocal enabledelayedexpansion
        set /p line=
        echo(%%a!line!>>%outfile%
        endlocal
    )
)<%f2%

pause

Concatenate 2 txt files line by line using Batch

But when I run it with non-ASCII (Greek) text I get a weird output and the encoding in result file changes from UTF-8 to Windows-1253 with corrupt Greek characters (despite all files involved, including the batch file, being UTF-8). Also, I get no separator (I want it to be tab).

Example input

file1

Agenda

file2

Διάταξη των εργασιών

Output

AgendaΔιάταξη των ΞµΟΞ³Ξ±ΟƒΞΉΟŽΞ½

Desired Output

Agenda[TAB]Διάταξη των εργασιών
greektranslator
  • 499
  • 1
  • 6
  • 19

1 Answers1

0

You'll need to use codepage 65001

@echo off
 CD /d "%~dp0"
 CHCP 65001 > nul
 Setlocal EnableExtensions DisableDelayedExpansion

 for /f "delims= " %%T in ('robocopy /L . . /njh /njs' )do set "TAB=%%T"

 set "f1=%~dp01.txt"
 set "f2=%~dp02.txt"
 set "outfile=%~dp0mix.txt"
 break>"%outfile%"

<"%f2%" (
 for /f "usebackq delims=" %%a in ("%f1%") do (
  set /p "right="
  set^ "left=%%a"
  setlocal enabledelayedexpansion
  >>"%outfile%" (echo(!left!!TAB!!right!)
  endlocal
))

type "%outfile%"
Pause

Output:

Agenda  Διάταξη των εργασιών

Edit

Batch / powershell solution to handle lines up to 8191 bytes

@echo off
 CD /d "%~dp0"
 CHCP 65001 > nul

 set "f1=%~dp01.txt"
 set "f2=%~dp02.txt"
 set "outfile=%~dp0mix.txt"
 break>"%outfile%"

 >"%Outfile%" (
   powershell -noprofile -command ^
   "$File1 = @(Get-Content "%f1%" -Encoding utf8); $File2 = @(Get-Content "%f2%" -Encoding utf8);$Max = [math]::Max($File1.GetUpperBound(0), $File2.GetUpperBound(0)); for($i = 0; $i -le $Max; $i++) {write-host ($File1[$i],$File2[$i]) -Separator "`t"}"
  )

 type "%outfile%" > Con
 Pause
goto:Eof

Note - the above does not qualify that both files have an equal number of lines, nor does it alter the action taken if one string is empty and the other is not - These are considerations not discussed in your question, and as such will not be adressed in this answer - Note, you should consider doing so if you anticipate an unequal number of lines may be encountered if there is any need to prevent such lines being prepended or appended with the seperator character.

A breakdown:

  • Load the files into arrays, specifying utf8 encoding:
$File1 = @(Get-Content "%f1%" -Encoding utf8)
$File2 = @(Get-Content "%f2%" -Encoding utf8)
  • Get the maximum number of lines:
 $Max = [math]::Max($File1.GetUpperBound(0), $File2.GetUpperBound(0))
  • For each array index from 0 to Maximum
    • write lines at current index specifying separator:
 for($i = 0; $i -le $Max; $i++) {
  write-host ($File1[$i],$File2[$i]) -Separator "`t"
 }

Lastly, the same as a powershell .ps1 script:

 CHCP 65001 | Out-null

 $F1 = (resolve-path -path $PSScriptRoot\1.txt).Path
 $F2 = (resolve-path -path $PSScriptRoot\2.txt).Path
 $Outfile = "$PSScriptRoot\mix.txt"

 $File1 = @(Get-Content "$F1" -Encoding utf8)
 $File2 = @(Get-Content "$F2" -Encoding utf8)

 $Max = [math]::Max($File1.GetUpperBound(0), $File2.GetUpperBound(0))

 $(for($i = 0; $i -le $Max; $i++) {
  $line = ($File1[$i],$File2[$i]) -join "`t"
  Write-Output $Line
 }) | Out-File $Outfile -encoding utf8

 Type $Outfile
T3RR0R
  • 2,747
  • 3
  • 10
  • 25
  • Interesting. When I open with Notepad/EmEditor, I still see the corrupt characters; when I open with Notepad++ I see normal text. Tab is not added in the output file and if the line is too long (over 1300 bytes approx.) it is cut off and aligned with a different line. Here are the files tested and output: https://www.translatum.gr/downloads/test-batch.zip For example line 251 in mix.txt has content from line 251 from 1.txt and line 247 (cut off) from 2.txt (Greek text). – greektranslator Sep 12 '21 at 18:39
  • Fixed the UTF-8 issue with suggestion from here https://superuser.com/a/685264/747811, issues of cutting off lines and missing tab remain. – greektranslator Sep 12 '21 at 19:02
  • Have _both_ files lines too long? `set /P` command can only read 1021 characters approx, so you must read the longest lines file via `for /` and the shorter ones with `set /P` – Aacini Sep 12 '21 at 20:16
  • Well, these are just test files. Real world files will probably exceed 1021 characters in both source and target. Not sure what code should change to accommodate that. – greektranslator Sep 13 '21 at 05:42
  • Yes, my bad. I was not aware that there could be such issues. Anyway, just tried with cmd the "Edit" version: `'CHCP' is not recognized as an internal or external command, operable program or batch file. 'powershell' is not recognized as an internal or external command, operable program or batch file.` It did not produce an output file. – greektranslator Sep 13 '21 at 12:25