2

I found a related answer here that is really helpful, but not quite what I'm looking for. There are also a number of other questions I've looked at, but I can't figure out how to get this to work unfortunately and it seems rather simple.

Basically, I'm using Import-Csv and manipulating a lot of data; but the names of the headers can sometimes change. So instead of re-writing my code, I'd like to map the headers I'm given to the headers that are used in my code blocks. Outputting the final data as a CSV, I can leave it using the 'updated headers' or, if I can figure out how to swap headers easily, I could always swap them back to what they were.

So let's say I have a mapping file in Excel. I can do the mapping in rows or columns, whichever will be easier. For this first example, I have the mapping in rows. When I use Import-CSV, I want to use the Headers from Row #2 instead of the headers in Row #1. Here's the content of the mapping file:

Mapping File Content

So basically if I hard coded this all, I'd have something like:

$null, $headerRow, $dataRows = (Get-Content -Raw foo.csv) -split '(^.+\r?\n)', 2

ConvertFrom-Csv ($headerRow.Trim() -replace 'Identification', 'ID' -replace 'Revenue Code', 'Revenue_Code' -replace 'Total Amount for Line', 'Amount' -replace 'Total Quantity for Line', 'Qty'), $dataRows

Except I don't want to hard code it, I am basically looking for a way to use Replace with a mapping file or hashtable if I can create one.

#Pseudo code for what I want
$hashtable = Get-Content mapping.xlsx
ConvertFrom-Csv ($headerRow.Trim() -replace $hashtable.Name, $hashtable.Value), $dataRows

I'm probably failing and failing to find similar examples since I'm trying to be flexible on the format of the mapping file. My original idea was to basically treat the 1st row as a string, and to replace that entire string with the second row. But the hashtable idea came from likely restructuring the mapping to look like this: Mapping Columns

Here I would basically -replace each Source value with the corresponding Target value.

EDIT If you need to convert back, give this a shot - but keep in mind it'll only work if you have a one-to-one relationship of Source:Target values.

#Changing BACK to the original Headers...
$Unmap = @{}
(Import-Csv MappingTable.csv).ForEach({$Unmap[$_.Target] = $_.Source})

#Get string data from CSV Objects
$stringdata = $outputFixed | ConvertTo-CSV -NoTypeInformation
$headerRow = $stringdata[0]
$dataRows = $stringdata[1..($stringdata.Count-1)] -join "`r`n"

#Create new header data
$unmappedHeaderRow = ($headerRow -replace '"' -split ',').ForEach({'"' + $Unmap[$_] + '"'}) -join ','

$newdata = ConvertFrom-Csv $unmappedHeaderRow, $dataStrings
immobile2
  • 489
  • 2
  • 15
  • If the column headers may change, may not also the order of the columns be changed? – stackprotector Sep 13 '21 at 17:30
  • If the order and count stay the same I'd be inclined to simply use an array index to reference between the desired headers and the current ones. Very easy to do by simply peeling the property names off the first object returned from `Import-Csv`. If I've got the scenario correct I can post answer. – Steven Sep 13 '21 at 17:32
  • 1
    @Steven - I don't think it is safe to say that the order and count are going to be the same. The Count should, generally, be the same. The order could be different. – immobile2 Sep 13 '21 at 17:39

2 Answers2

1

Here's a complete example that builds on your original attempt:

  • It provides the column-name (header) mapping via (another) .csv file, with columns Source and Target, where each row maps a source name to a target name, as (also) shown in your question.

  • The mapping CSV file is transformed into a hashtable that maps source names to target names.

  • The data CSV file is then read as plain text, as in your question - efficiently, but in full - split into header row and data rows, and a new header row with the mapped names is constructed with the help of the hashtable.

  • The new header row plus the data rows are then sent to ConvertFrom-Csv for to-object conversion based on the mapped column (property) names.

# Create sample column-name mapping file.
@'
Source,Target
Identification,Id
Revenue Code,Revenue_Code
'@ > mapping.csv

# Create a hashtable from the mapping CSV file
# that maps each Source column value to its Target value.
$map = @{}
(Import-Csv mapping.csv).ForEach({ $map[$_.Source] = $_.Target })

# Create sample input CSV file.
@'
Revenue Code,Identification
r1,i1
r2,i2
'@ > data.csv

# Read the data file as plain text, split into a header line and
# a multi-line string comprising all data lines.
$headerRow, $dataRows = (Get-Content -Raw data.csv) -split '\r?\n', 2

# Create the new header based on the column-name mapping.
$mappedHeaderRow =
  ($headerRow -replace '"' -split ',').ForEach({ $map[$_] }) -join ','

# Parse the data rows with the new header.
$mappedHeaderRow, $dataRows | ConvertFrom-Csv

The above outputs the following, showing that the columns were effectively mapped (renamed):

Revenue_Code Id
------------ --
r1           i1
r2           i2
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • I hate to ask for speed related advice, but you always seem to have pretty spot-on knowledge of the different methods. As my mapping tables grow in size, it seems that `(Import-Csv mapping.csv).ForEach({$map[$_.Source] = $_.Target})` is going to be faster. Do you know if there is any performance difference in using `$map[Source] = Target` or using `$map.Add(Source, Target)`? They both seem to perform the same with demo data and my time measurements, but curious if the only differentiator is the syntax that to use – immobile2 Sep 13 '21 at 20:23
  • @immobile2, yes, it looks like they perform about the same, but there is a behavioral difference: `[$key] = $value` quietly updates a _preexisting entry_ if one exists for `$key`, whereas `.Add($key, $value)` throws an exception (which surfaces as a statement-terminating error in PowerShell). However, the syntax to avoid for performance reasons is dot notation, which PowerShell also supports for hashtables: `.$key = $value` – mklement0 Sep 13 '21 at 21:26
  • As an aside: As I've just learned myself, it isn't the pipeline per se that slows a `ForEach-Object` solution down, it's the inefficient implementation of `ForEach-Object` itself, and the following workaround even outperforms the `.ForEach()` method: `Import-Csv mapping.csv | . { process { $map[$_.Source] = $_.Target } }` - see [this blog post](https://powershell.one/tricks/performance/pipeline) and the resultant feature request [GitHub issue #10982](https://github.com/PowerShell/PowerShell/issues/10982). – mklement0 Sep 13 '21 at 21:57
  • Another aside: I've looked at [`ConvertFrom-StringData`](https://learn.microsoft.com/powershell/module/microsoft.powershell.utility/convertfrom-stringdata), combined with a properties style input file (lines of `=` pairs), which directly constructs a hashtable, but, surprisingly, it performs worse than the `Import-Csv`-based solution. – mklement0 Sep 13 '21 at 21:59
  • Thanks for the feedback! I've looked at the blog post before, I think I actually may have linked it in a comment or another question you answered here on SO. My hashtables aren't going to reach a size where I'd sacrifice the readability of `.ForEach()` for a pipeline solution. If I was five or six pipelines in when I wanted a ForEach...that might change things! What is the period doing in your `process` script though? I thought it would be `Import-Csv mapping.csv | & { process { ... }}` dot notation is pretty foreign to me though, so I wouldn't have used `.$key` and your dot doesn't track – immobile2 Sep 14 '21 at 00:06
  • @immobile2, `. ` is the [dot-sourcing operator](https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_Operators#dot-sourcing-operator-), which runs the script block in the same scope as the caller, which is what `ForEach-Object` does. By contrast, `&`, the [call operator](https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_Operators#call-operator-) runs script blocks in a _child scope_. Using dot notation to access hashtable entries is common in PowerShell, as foreign as it may seem to users coming from other languages. – mklement0 Sep 14 '21 at 01:01
  • One change - although I can't figure out why I need to do this and you don't in your example...but it needs to be `$mappedHeaderRow = ($headerRow.Trim() -replace '"' -split ',').ForEach({ $map[$_] }) -join ','`. The `.Trim()` was needed in a [related answer](https://stackoverflow.com/a/68441577/15243610) that you had and there is a trailing newline that _sometimes_ gets added – immobile2 Sep 15 '21 at 02:06
  • @immobile2, the regex in this answer is subtly different: `^(.+)\r?\n` vs. `(^.+r?\n)`, which ensures that the newline is _not_ captured (I've since also changed the regex in the linked answer). If you still need `.Trim()`, the implication is that the header line had trailing (intra-line) whitespace. – mklement0 Sep 15 '21 at 06:16
  • @immobile2, in fact, even the modified regex was needlessly complicated: `$headerRow, $dataRows = (Get-Content -Raw data.csv) -split '\r?\n', 2` will do, and I've updated both answers accordingly. – mklement0 Sep 15 '21 at 06:34
0

The easiest thing to do here is to process the CSV and then transform each row, from whatever format it was, into a new desired target format.

Pretend we have an input CSV like this.

RowID,MayBeNull,MightHaveAValue
1,,Value1
2,Value2,
3,,Value3

Then we import the csv like so:

#helper function for ugly logic
function HasValue($param){
    return -not [string]::IsNullOrEmpty($param)
}

$csv = import-csv C:\pathTo\this.csv

foreach($row in $csv){
   if (HasValue($row.MayBeNull)){
       $newColumn = $row.MayBeNull
    }
    else{
       $newColumn = $row.MightHaveAValue
    }
    #generate new output
    [psCustomObject]@{
       Id = $row.RowId;
     NewColumn = $newColumn
    }
}

Which gives the following output:

enter image description here

This is an easy pattern to follow for a data migration script, then you just need to scale it up to fix your problem.

FoxDeploy
  • 12,569
  • 2
  • 33
  • 48
  • It seems like this might work, but I would need to put it into the entire script to see how much slower it makes everything. The reason I'm using `Import-CSV` and Powershell in the first place is because I'm cleaning up CSV data that is too large to open/manipulate in Excel. Preferably then, I only need to iterate through each row one time. Right now, I'm doing all the clean-up within a Foreach loop. With this method, I'd need to possibly iterate twice. Maybe I didn't phrase the question correctly, but I also don't see how I could avoid hardcoding `NewColumn` in your example – immobile2 Sep 13 '21 at 17:42