0

I am looking for a clean way to deidentify values in a csv file's column. The way I was working on was kind of a hack with counting values etc and wanted to see if a more realistic way to approach this. The input would be similar to:

One~Two~Three~Four
Test One~Failed for account 999999999, UserId: 7777777, Error: Duplicate nickname~8abc-964863c382d8~3/11/2021 1:03:43 PM
Test One~Failed for account 121212121, UserId: 3434343, Error: Duplicate firstname~zzzz-964863c382d8~3/11/2021 1:04:43 PM
Test One~Failed for account 565656565, UserId: 7878787, Error: Duplicate firstname~yyyy-964863c382d8~3/11/2021 1:05:43 PM

And need the output to be like this, replacing the userId and account number with x's:

One~Two~Three~Four
Test One~Failed for account XXXXXXXXX, UserId: XXXXXXX, Error: Duplicate nickname~8abc-964863c382d8~3/11/2021 1:03:43 PM
Test One~Failed for account XXXXXXXXX, UserId: XXXXXXX, Error: Duplicate firstname~zzzz-964863c382d8~3/11/2021 1:04:43 PM
Test One~Failed for account XXXXXXXXX, UserId: XXXXXXX, Error: Duplicate firstname~yyyy-964863c382d8~3/11/2021 1:05:43 PM

There would be other lines with data they may not have these types of lines, but I only want to identify these 2 types of values in the Two column. Any ideas would be appreciated!

  • If they are always the same number of digits preceded by the same word I think Replace using a regex would work for you. – Bee_Riii Mar 12 '21 at 15:23
  • Will they always be in that same context? Ie. account id is preceded by the text "account " and followed by a comma? – Mathias R. Jessen Mar 12 '21 at 15:24
  • Yes they would be exactly as above - Ie. account id is preceded by the text "account " and followed by a comma and then UserId: and number – Jack Black Mar 12 '21 at 15:27

1 Answers1

1

Misread the question at first..

You can import the csv and change only the identities in field Two like this:

$data = Import-Csv -Path 'D:\Test\TheFile.csv' -Delimiter '~'
foreach ($item in $data) {
    $item.Two =  $item.Two -replace '(account\s*)[^,]+', '$1XXXXXXXXX' -replace '(UserId:\s*)\d+', '$1XXXXXXX'
}
$data | Export-Csv -Path 'D:\Test\TheUpdatedFile.csv' -Delimiter '~' -NoTypeInformation

Export-Csv above will quote every field in the CSV If you have PowerShell 7, you can add -UseQuotes AsNeeded to the Export-Csv cmdlet to have the fields unquoted. For PowerShell versions below that, you can use my function ConvertTo-CsvNoQuotes I posted earlier.

Output (with quotes)

"One"~"Two"~"Three"~"Four"
"Test One"~"Failed for account XXXXXXXXX, UserId: XXXXXXX, Error: Duplicate nickname"~"8abc-964863c382d8"~"3/11/2021 1:03:43 PM"
"Test One"~"Failed for account XXXXXXXXX, UserId: XXXXXXX, Error: Duplicate firstname"~"zzzz-964863c382d8"~"3/11/2021 1:04:43 PM"
"Test One"~"Failed for account XXXXXXXXX, UserId: XXXXXXX, Error: Duplicate firstname"~"yyyy-964863c382d8"~"3/11/2021 1:05:43 PM"
Theo
  • 57,719
  • 8
  • 24
  • 41
  • This is excellent and just what I was looking for. Studying regular expressions this weekend. You guys make this look too easy! Thanks again, marking as solved! – Jack Black Mar 12 '21 at 16:17