1

Given a table in a text file that contains row based data, what method would you recommend for converting to a column based table? (e.g. CSV).

Input_data.txt:

Source =         X:\folder_abc
Destination =    Y:\Abc_folder
Total bytes =    208,731,021
MB per min =     256.5
Source =         X:\folder_def
Destination =    Y:\xyz_folder
Total bytes =    123,134,545
MB per min =     326
Source =         X:\folder_foo
Destination =    Y:\Baz_folder
Total bytes =    24,344
MB per min =     532
...etc.

Desired Result (only formatted with tabs here for legibility):

Source,             Destination,        Total bytes,    MB per min
"X:\folder_abc",    "Y:\Abc_folder",    "208,731,021",  "256.5"
"X:\folder_def",    "Y:\xyz_folder",    "123,134,545",  "326"
"X:\folder_foo",    "Y:\Baz_folder",    "24,344",       "532"
...

Tools at my disposal are Windows batch files and Powershell. Prefer a .bat solution because I'm more comfortable there, but if that is too circuitous or opaque we can punt it.

UPDATE, as per comments

I've figured out how to turn the records into name & value variables, but don't know how to manipulate them from that point to transpose into columns.

for /f "tokens=1,2 delims==" %%a in ('findstr /c:"=" "%logfile%"') do (
  @echo %%a %%b
  )

It just occured to me I can do one column per text file and then append them all in Excel. Crude but workable perhaps(?)

for /f "tokens=1,2 delims==" %%a in ('findstr /c:"=" "%logfile%"') do (
  @echo %%b >>  %%a.csv
  )

UPDATE-2: quote all values in desired result, as dbenham pointed out not doing so will cause problems.

matt wilkie
  • 17,268
  • 24
  • 80
  • 115

4 Answers4

3

This problem may be easily solved with a pure Batch file that create several arrays, one per each column of the output file (field). When the input file is read, the index of the arrays is incremented each time the starting field appear ("Source" in this case), so subsequent elements are stored at the right positions in their respective arrays. The output just show one element from each array in the same line.

@echo off
setlocal EnableDelayedExpansion

set "header="
set "output="
set i=0
for /F "tokens=1* delims==" %%a in (Input_data.txt) do (
   set "field=%%a"
   set "field=!field:~0,-1!"
   if "!field!" equ "Source" set /A i+=1
   if !i! equ 1 (
      set "header=!header!,"!field!""
      set "output=!output!,"^^!!field![%%i]^^!""
   )
   for /F %%c in ("%%b") do set "!field![!i!]=%%c"
)

(
echo %header:~1%
for /L %%i in (1,1,%i%) do echo %output:~1%
) > Result.csv

Output example:

"Source","Destination","Total bytes","MB per min"
"X:\folder_abc","Y:\Abc_folder","208,731,021","256.5"
"X:\folder_def","Y:\xyz_folder","123,134,545","326"
"X:\folder_foo","Y:\Baz_folder","24,344","532"

You may review the array management in Batch files at: Arrays, linked lists and other data structures in cmd.exe (batch) script

EDIT: New method with no arrays added

After I read dbenham's comment I realized that the use of arrays in this problem is not necessary, so I modified my original solution accordingly; I also borrowed dbenham's trick of using %%~Na in order to eliminate spaces at end of the field name:

@echo off
setlocal EnableDelayedExpansion

set "header=1"
set "row="
(for /F "tokens=1* delims==" %%a in (Input_data.txt) do (
   if defined header set "header=!header!,"%%~Na""
   for /F "tokens=*" %%c in ("%%b") do set "row=!row!,"%%c""
   if "%%a" equ "MB per min " (
      if defined header echo !header:~2!& set "header="
      echo !row:~1!
      set "row="
   )
)) > Result.csv
Community
  • 1
  • 1
Aacini
  • 65,180
  • 12
  • 72
  • 108
  • I don't understand why the use of an array. It will slow things down if the file is large (creating large environment). `"tokens=*"` should be used when stripping leading spaces, just in case a value contains spaces. – dbenham Sep 04 '14 at 04:29
  • @dbenham: You are right! I modified my solution accordingly. – Aacini Sep 04 '14 at 14:43
2

I realize that you are not familiar with PowerShell, but it's probably something that you should look into. I was in your position about 3 years ago and now use it in place of batch files 90% of the time.

In PowerShell this is relatively simple. You can run the array of strings through a ForEach loop, creating an object and adding members to it for each property, then when you reach a new Source line output the previous object and start a new one. It automatically makes an array for you, and you can pipe that to Export-CSV.

What I will do specifically is setup a variable $Record as an empty string.

Then I get the content of the file, and pipe it to a Where statement that will match each line against a RegEx match. That will create the automatic variable $Matches, which gets passed down the pipe along with the line. The match will capture everything before the first colon, and then everything following the colon and any trailing whitespace.

This is piped to a ForEach loop, which will perform once for each line. It checks if $Matches[1] (everything before the first colon) = 'Source'. If it does, it outputs the current contents of $Record, and creates a new $Record as a custom object with one property: 'Source' = $Matches[2] (everything after the first colon and trailing whitespace). If $Matches[1] does not equal 'Source' then it adds a new property to $Record where the property name is $Matches[1], and the value is $Matches[2]. For cleanliness I have performed the .Trim() method on $Matches[2] just to make sure there's no leading or trailing spaces or line feeds or anything strange.

After I process everything I run it through a Where statement again to remove blank records (such as the first one I setup beforehand). Then I output $Record one more time. As you said you wanted this in a CSV I have piped the whole loop and the trailing $Record to Export-CSV

$Record = ""
$Output = @()
Get-Content Input_data.txt |     Where{$_ -match "([^:]*):\s*?(\S.*)"}|Foreach{
    if($Matches[1] -eq "Source"){
        $Output += $Record
        $Record = [PSCustomObject]@{'Source'=$Matches[2].trim()}
    }else{
        $Record | Add-Member $Matches[1] $Matches[2].trim()
    }
}|?{![string]::IsNullOrEmpty($_)} | Export-Csv Output.csv -NoTypeInformation
$Output += $Record
$Output | Export-Csv Output.csv -NoTypeInformation -Append

The result is a csv file with these contents:

"Source","Destination","Total bytes","MB per min"
"X:\folder_abc","Y:\Abc_folder","208,731,021","256.5"
"X:\folder_def","Y:\xyz_folder","123,134,545","326"
"X:\folder_foo","Y:\Baz_folder","24,344","532"

Or if you don't pipe it to Export-CSV it simply displays it on screen:

Source                    Destination              Total bytes              MB per min              
------                    -----------              -----------              ----------              
X:\folder_abc             Y:\Abc_folder            208,731,021              256.5                   
X:\folder_def             Y:\xyz_folder            123,134,545              326                     
X:\folder_foo             Y:\Baz_folder            24,344                   532

Edit: Ok, you get errors with Add-Member the way I'm using it. That means you have an older version of PowerShell. There's 2 solutions to that. The first, and my recommendation, update PowerShell. Sometimes that isn't an option though, so that's fine, we can work with that.

The way I'm using Add-Member doesn't work if you are on PS v1 or v2. How I'm using it is that if you pipe an object to Add-Member and then specify 2 string arguments it assumes the first is a NotePropertyName and the second is a NotePropertyValue. You can see what it looks like above. So what to do if that doesn't work is to use the more verbose syntax of:

Add-Member -InputObject $TargetVariable -MemberType NoteProperty -Name Name -Value Value

In our case it means we replace the Add-Member line as such:

Add-Member -InputObject $Record -MemberType NoteProperty -Name $Matches[1] -Value $Matches[2].trim()

And you went and changed the input. That's easy to fix... Change the RegEx match from "([^:]*):\s*?(\S.*)" to "([^=]*)=\s*?(\S.*)". So putting it all together:

$Record = ""
$Output = @()
Get-Content Input_data.txt | Where{$_ -match "([^=]*)=\s*?(\S.*)"}|Foreach{
    if($Matches[1] -eq "Source"){
        If(![String]::IsNullOrEmpty($Record)){$Output += $Record}
        $Record = [PSCustomObject]@{'Source'=$Matches[2].trim()}
    }else{
        Add-Member -InputObject $Record -MemberType NoteProperty -Name $Matches[1] -Value $Matches[2].trim()
    }
}
$Output += $Record
$Output | Export-Csv C:\Temp\Output.csv -NoTypeInformation

Edit2: I guess I had forgotten that -Append isn't an option for Export-Csv in older versions of PowerShell. This can be worked with by collecting all data and outputting it once at the end. I have updated the last script in my answer by creating an empty array $Output near the top, then in the loop instead of just outputting $Record whenever one is completed I add it to the array. I also modified that line to go through an If statement to avoid adding blank records to the array. Then after the ForEach loop I add the last record to the array, and finally output the entire array of records to a CSV file.

TheMadTechnician
  • 34,906
  • 3
  • 42
  • 56
  • I changed the input on you, sorry! The first `:` is now `=` to avoid problems with drive letters. However even without that I couldn't get your script to work on my machine. Input_data_colons.txt http://hastebin.com/raw/wosaligexe, script: http://hastebin.com/irahaqifuc.mel, error messages: http://hastebin.com/vehadodore.tex, and the error itself _"Add-Member : A positional parameter cannot be found that accepts argument 'Destination'."_ – matt wilkie Sep 04 '14 at 16:04
  • Ok, you have an older version of PowerShell, I hadn't accounted for that, but I have updated the answer and it should work for you now. – TheMadTechnician Sep 04 '14 at 16:27
  • I appreciate your efforts and encouragement to bring me into the light of PS and out of BAT dark ages. Unfortunately it sill doesn't work (_"A parameter cannot be found that matches parameter name 'Append'."_), and I need to push on with the original objective. A +5 for the clear explanation and not just a plop of "here this works" code. I will definitely revisit this thread when work spirals back in this direction, as it always does. – matt wilkie Sep 05 '14 at 20:07
  • Thanks for giving it a go, sorry about the compatibility issues that stopped the code from working. I did update the answer, to resolve the -append thing, and maybe you can look at that if/when you circle back. Good luck with your project, and please don't let my version incompatibility issues dissuade you from trying to pick up PowerShell. – TheMadTechnician Sep 05 '14 at 20:23
2

This is similar to Aacini's original answer, except I never store more than one line in memory. A large input file would consume a lot of memory, which would slow down the script. Storing only one line avoids that problem.

Another major difference is I let the code discover the column name that starts a new row, rather than hard coding the value.

I also use a different method to strip off the trailing space(s) from each column name in the header. I assume that the column names do not contain any of the following characters: :, ., \, or /. I rely on the fact that file names cannot end with a space, so the ~n modifier normalizes the "name" to remove any trailing space(s).

I also use "tokens=*" when stripping leading spaces from the values, just in case a value contains spaces.

@echo OFF
setlocal enableDelayedExpansion

set "input=test.txt"
set "output=result.csv"

set "row="
set "header="
set "begin="
set "first="
(
  for /f "usebackq tokens=1* delims==" %%A in ("%input%") do for /f "tokens=*" %%C in ("%%B") do (
    if "!begin!" equ "%%A" (
      if not defined first (
        set first=1
        echo !header:~1!
      )
      echo !row:~1!
      set "row="
    )
    set "row=!row!,"%%C""
    if not defined first for /f "delims=" %%H in ("%%A") do (
      if not defined begin set "begin=%%A"
      set "header=!header!,"%%~nH""
    )
  )
  echo !row:~1!
)>"%output%"


EDIT 2014-12-05

The same algorithm could be implemented more robustly in VBS or JScript, and it would be faster.

Or you could get a bit of a jump start and use JREPL.BAT - a hybrid JScript/batch utility that performs regular expression search and replace on text. It allows user defined JScript code snippets to be incorporated into the process, but is executed within a batch context.

The entire command could be put on one lonnnnnggggggg line, but that would be really ugly. Instead I use batch line continuation to define a variable with most of the user defined JScript code and pass that in using /JBEG. It is impossible to pass a double quote literal to CSCRIPT, so I use '\x22' instead.

The script expects the source file to be passed as the first and only argument, and the output is written to the same location using the same base name with a .csv extension.

@echo off
setlocal
set beg=^
var begin, header='.', line='', q='\x22';^
function writeLn(){^
  if (header) output.WriteLine(header.substr(2));^
  header='';^
  if (line) output.WriteLine(line.substr(1));^
  line='';^
}^
function repl($1,$2){^
  if ($1==begin) writeLn();^
  if (!begin) begin=$1;^
  if (header) header+=','+q+$1+q;^
  line+=','+q+$2+q;^
  return false;^
}
call jrepl "^(.+?) *= *(.*)" "repl($1,$2);" /jmatch /jbeg "%beg%" /jend "writeLn();" /f %1 /o "%~dpn1.csv"
exit /b

Below uses the exact same JScript code, but I use the /JLIB option to load it directly from the file instead of from a variable. The script uses standard hybrid Jscript/batch technology. This option allows me to use a double quote literal in the code.

@if (@X)==(@Y) @end /* harmless hybrid line that begins a JScript comment

::**** Batch code ********
@echo off
call jrepl "^(.+?) *= *(.*)" "repl($1,$2);" /jmatch /jlib "%~f0" /jend "writeLn();" /f %1 /o "%~dpn1.csv"
exit /b

****** Jscript code ******/

var begin, header='.', line='', q='"';

function writeLn(){
  if (header) output.WriteLine(header.substr(2));
  header='';
  if (line) output.WriteLine(line.substr(1));
  line='';
}

function repl($1,$2){
  if ($1==begin) writeLn();
  if (!begin) begin=$1;
  if (header) header+=','+q+$1+q;
  line+=','+q+$2+q;
  return false;
}
dbenham
  • 127,446
  • 28
  • 251
  • 390
  • I accepted this solution because (after study) I understand what it's doing, and it's the only .bat solution which doesn't require some advance knowledge of the column headings. The input_data.txt sample posted is simplified subset, and not all input files have all headings. This approach is the only one which handled all of them. – matt wilkie Sep 05 '14 at 20:59
  • for the interested, my annotated version of dbenham's script is at https://github.com/maphew/Speed-test/blob/master/stats/xxcopylog_to_stats.bat – matt wilkie Sep 08 '14 at 18:01
2

This works with the source data:

@echo off
(
 for /f "usebackq tokens=1,* delims==" %%a in ("input_data.txt") do (
   if not defined header echo Source,Destination,Total bytes,MB per min&set header=1
   for /f "tokens=*" %%c in ("%%b") do if "%%a"=="MB per min " (set/p=""%%c""<nul&echo() else (set/p=""%%c","<nul)
 )
)>"output_data.txt"

"output_data.txt"

Source,Destination,Total bytes,MB per min
"X:\folder_abc","Y:\Abc_folder","208,731,021","256.5"
"X:\folder_def","Y:\xyz_folder","123,134,545","326"
"X:\folder_foo","Y:\Baz_folder","24,344","532"
foxidrive
  • 40,353
  • 10
  • 53
  • 68
  • 2
    You may move the `echo Source,...` before the FOR loop and eliminate the `if not defined header` and `set header=1` – Aacini Sep 04 '14 at 14:47
  • @Aacini Yes, Aacini, that's true. – foxidrive Sep 04 '14 at 15:08
  • After some study I understand what's happening in the FOR loops. I'm lost in the `set/p` though. I understand the effect -- emit new line only on last field -- just not how it works. Could you elaborate perhaps? – matt wilkie Sep 05 '14 at 20:00
  • The `set/p` prints text without a trailing CRLF. My if command either prints `"text",` or `"text"` and an `echo(` which adds the CRLF to the end of the line. – foxidrive Sep 06 '14 at 04:57