4

I have a load of personal ID numbers, which I need to download automatically. Rather than store thousands of unique personal IDs on my machine, I want to multiply all the IDs by a random integer, save the result, and delete the original file.

This will end up in Excel, but VBA is not an option because of org policy. I can't install anything - limited to what comes with Windows. As far as I understand, Excel's power query isn't able to delete a source file after reading it, but I could be wrong - if so that could be another direction to go in.

Sample of csv (the ids are not real)

136983,96771,216453,211521,589046,513320,195746,404556,190103,312261
,,,,,,,,,
,,,,,,,,,
,,,,,,,,,
413213,122904,135502,282999,460212,343970,433526,530707,,
,,,,,,,,,
,,,,,,,,,
216327,121969,165339,538695,440435,175431,396049,360505,160665,375980
,,,,,,,,,
251625,327979,155758,231189,385458,342295,365599,100644,312974,385772
,,,,,,,,,
398438,114618,216349,356734,504029,256328,466559,534053,144303,211043
306499,515840,90297,434565,,,,,,

Features

  • It could be any number of rows, probably between a couple of hundred and ten thousand.
  • Many of the rows are blank. Blank rows need to be preserved, because another column will need to be imported to go next to id (a blank row indicates no people correspond to this value from the imported column)
  • Many of the rows have less than 10 id numbers - less than the maximum, in other words.

Desired output

Let's say that the random integer was 2, this would produce:

273966,193542,432906,423042,1178092,1026640,391492,809112,380206,624522
,,,,,,,,,
,,,,,,,,,
,,,,,,,,,
826426,245808,271004,565998,920424,687940,867052,1061414,,
,,,,,,,,,
,,,,,,,,,
432654,243938,330678,1077390,880870,350862,792098,721010,321330,751960
,,,,,,,,,
503250,655958,311516,462378,770916,684590,731198,201288,625948,771544
,,,,,,,,,
796876,229236,432698,713468,1008058,512656,933118,1068106,288606,422086
612998,1031680,180594,869130,,,,,,

What I've tried

This answer looks promising, but it only deals with one column. I couldn't figure how to adapt it so that it accepts any number of columns.

@echo off
setlocal enabledelayedexpansion
FOR /F "tokens=1-18* delims=," %%A IN (mycsv.csv) DO (
set sum1="%%~C"
set /a cole=!sum1! * 2
echo %%~A,%%~B,%%~C,%%~D,!cole!
) >> output.csv

I don't really understand this Powershell based answer.

$csv = Import-Csv mycsv.csv
foreach ($row in $csv) {
  [int]$row.B *= -1
  [int]$row.F *= -1
}
$csv | Export-Csv output.csv

If I try to use it, it returns

powershell : Exception setting "B": "The property 'B' cannot be found on this object. Verify that the property exists and can be 
At line:1 char:1
+ powershell -ep Bypass .\t.ps1
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Exception setti...sts and can be :String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
 
set."
At C:\Users\user\OneDrive\Personal\t.ps1:3 char:3
+   [int]$row.B *= -1
+   ~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [], SetValueInvocationException
    + FullyQualifiedErrorId : ExceptionWhenSetting
 
Exception setting "F": "The property 'F' cannot be found on this object. Verify that the property exists and can be 
set."
At C:\Users\user\OneDrive\Personal\t.ps1:4 char:3
+   [int]$row.F *= -1
+   ~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [], SetValueInvocationException
    + FullyQualifiedErrorId : ExceptionWhenSetting
 
Exception setting "B": "The property 'B' cannot be found on this object. Verify that the property exists and can be 
set."
At C:\Users\user\OneDrive\Personal\t.ps1:3 char:3
+   [int]$row.B *= -1
+   ~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [], SetValueInvocationException
    + FullyQualifiedErrorId : ExceptionWhenSetting
 
Exception setting "F": "The property 'F' cannot be found on this object. Verify that the property exists and can be 
set."
At C:\Users\user\OneDrive\Personal\t.ps1:4 char:3
+   [int]$row.F *= -1
+   ~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [], SetValueInvocationException
    + FullyQualifiedErrorId : ExceptionWhenSetting

I assumed that B and F just referred to the rows in the csv, but if I understand the error message (I may not) they're names that have to be defined (how?) Also I need this to work across all rows which have values, and skip empty rows and empty columns within rows. So hardcoding the rows and columns to be multiplied isn't going to work for me here.

Groping around in the dark, replacing

  [int]$row.B *= -1
  [int]$row.F *= -1

with

  [int]$row *= -1

returns

powershell : Cannot convert the "@{12=18}" value of type "System.Management.Automation.PSCustomObject" to type "System.Int32".
At line:1 char:1
+ powershell -ep Bypass .\t.ps1
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Cannot convert ..."System.Int32".:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
 
At C:\Users\user\OneDrive\Personal\t.ps1:3 char:3
+   [int]$row *= -1
+   ~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (:) [], RuntimeException
    + FullyQualifiedErrorId : ConvertToFinalInvalidCastException
 
Cannot convert the "@{12=}" value of type "System.Management.Automation.PSCustomObject" to type "System.Int32".
At C:\Users\user\OneDrive\Personal\t.ps1:3 char:3
+   [int]$row *= -1
+   ~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (:) [], RuntimeException
    + FullyQualifiedErrorId : ConvertToFinalInvalidCastException
Ne Mo
  • 198
  • 4
  • 18
  • 1
    Does `mycsv.csv` have a header row? If not, does it always have the same number of columns? – Mathias R. Jessen Feb 01 '22 at 16:33
  • 1
    In order to be able to access a row's 2nd column value with `$row.B`, the CSV file must have a column _header_ named `B` - it looks like your CSV file is missing a header row altogether (or you're not showing it and its column names are different). If the headers are missing, you can supply them with `Import-Csv`'s `-Header` parameter. – mklement0 Feb 01 '22 at 16:35
  • 1
    if no header, you can manually iterate over it with Get-Content and looping through each line. You can break up the comma-separated values with split. – thepip3r Feb 01 '22 at 16:36
  • 1
    Thanks all. I'll try these approaches tomorrow and get back to you. There is no header row – Ne Mo Feb 01 '22 at 18:27
  • @NeMo i've added what approach you could take for a CSV without headers – Santiago Squarzon Feb 01 '22 at 18:54
  • 1
    what is the point of the following? >>> `Rather than store thousands of unique personal IDs on my machine, I want to multiply all the IDs by a random integer, save the result, and delete the original file.` <<< if it is just some sort of pseudo "security" ... i would not bother. secure the file, instead. the easiest is to load it into PoSh and then save it in something like a secure string OR encrypt it with a key file that is protected. – Lee_Dailey Feb 01 '22 at 20:35
  • @Lee_Dailey the main security measure is that only users who are supposed to see this information have read access to it. It's partly intended as an extra layer of security, yes. It's also a legal box ticking exercise - the id numbers can be directly linked to individuals' names if one were able to access the database. If someone got hold of the end product of the batch file after all this was done, at least they would not have numbers they could look up so not technically a breach of personal data. – Ne Mo Feb 01 '22 at 23:05
  • I'm doing all this without any educational background in scripting, so if there's more I can do from a security point of view, please do tell. – Ne Mo Feb 01 '22 at 23:06
  • @NeMo - that is ... _truly questionable_ "security". [*grin*] protect the data source with access permissions, not by obscuring the numbers. **_if nothing else, encrypt the data on your system using the built in stuff so that only your account on your system can decrypt it._** – Lee_Dailey Feb 02 '22 at 02:19
  • Naive question, but which built-in stuff are we talking about? Beyond restricting read access to those who need it? – Ne Mo Feb 02 '22 at 09:25
  • The end result needs to be read into an Excel Power Query. So something like Bitlocker isn't going to work - the IDs need to be encrypted individually. They are repeated throughout the CSV, so if 1234 becomes fgkb_+;_367, all occurrences of 1234 need to be fgkb_+;_367. – Ne Mo Feb 02 '22 at 09:41

5 Answers5

4

The logic for multiplying the values on each row of your CSV can be like this, assuming the CSV has a fixed number of columns:

function MultiplyRows {
[cmdletbinding()]
param(
    [parameter(Mandatory, ValueFromPipeline)]
    [object]$InputObject,
    [int]$MultiplyBy = 2
)

    begin { $isFirstObject = $true }
    process {
        if($isFirstObject) {
            $headers = $InputObject.PSObject.Properties.Name
            $isFirstObject = $false
        }

        $out = [ordered]@{}
        foreach($prop in $headers) {
            if(-not ($thisValue = [int]$InputObject.$prop)) {
                $out[$prop] = $null
                continue
            }
            $out[$prop] = $thisValue * $MultiplyBy
        }
        [pscustomobject]$out
    }
}

Import-Csv ... | MultiplyRows -MultiplyBy 3 | ConvertTo-Csv

If the CSV has no headers, you will need to determine how many columns it has and use the -Header parameter on Import-Csv. You can use the following, which, by looking at your CSV it should work but note that it may not work for all CSVs since some of then can have embedded commas like:

[pscustomobject]@{
    ExampleCol = 'value with , comma'
    AnotherCol = 'some value'
} | ConvertTo-Csv

# Results in:
"ExampleCol","AnotherCol"
"value with , comma","some value"

Using the function from before, and the CSV provided in question this is how the code would look:

$headers = (Get-Content ./some.csv -TotalCount 1).Split(',').ForEach({
    begin { $i = 0 }
    process { "Col{0}" -f $i++ }
})

# -MultiplyBy 2 is implied (Default Param value)
Import-Csv ./some.csv -Header $headers | MultiplyRows | ConvertTo-Csv

Which results in:

"Col0","Col1","Col2","Col3","Col4","Col5","Col6","Col7","Col8","Col9"
"6931098","5009052","5009132","6931104","5012678","6931102","4635438","5012902","6931094","6931100"
,,,,,,,,,
,,,,,,,,,
,,,,,,,,,
"6931092","6931088","5012886","6931084","6931090","5013392","6931086","5013550",,
,,,,,,,,,
,,,,,,,,,
"4713760","4713558","4713656","4713888","4713830","4713834","4713856","4713566","4713700","4713562"
,,,,,,,,,
"6883024","4934280","5143272","5143534","5143624","4934294","6883028","5143296","6883026","6883022"
,,,,,,,,,
"4761264","4761164","4761258","4761172","4761226","4761074","4761202","4761272","4761078","4760942"
"6442852","6442848","6442850","6442858",,,,,,
"5142022","4973532","3409286","5142074","5142080","5142036","5142086","5141880","4973484","4973540"
"5900466","5900428","5900536","6304092","6304100","6304094","6304098","6304102","6304096","5900768"
Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37
2

The for /F loop available in a combines multiple consecutive delimiters to one, so you cannot use it with the comma as delimiter to read a CSV file with empty cells/values/fields. However, you could use it to read full lines, together with a standard for loop to split each at commas, given that there are none of the characters ?, * and <, > (so for does not need to access the file system to derive matching file names and therefore maintains the given strings):

@echo off
setlocal EnableExtensions DisableDelayedExpansion

rem // Define constants here:
set "_FILE=%~dp0data.csv"         & rem // (full path to target file)
set "_SEPC=,"                     & rem // (separator character)
set "_HEAD="                      & rem // (set to something if header)
set /A "_MULT=%RANDOM%%%(1<<7)+1" & rem // (multiplicator; result < 2 Gi!)
set /A "_ZPAD=0"                  & rem // (optionally zero-pad to width)

setlocal EnableDelayedExpansion
if defined _HEAD (set "LINE=") else (set "LINE=#")
rem // Read file line by line:
for /F usebackq^ delims^=^ eol^= %%J in ("!_FILE!") do (
    if defined LINE (
        set "LINE=%%J" & set "COLL=%_SEPC%"
        rem // Iterate through separated items, even blank ones:
        for %%I in ("!LINE:%_SEPC%=" "!") do (
            rem // Retain blank items in case:
            if not "%%~I"=="" (
                rem set /A "ITEM=%%~I*_MULT"
                rem // Multiply item and correctly handle zero-padded numbers:
                set "ITEM=00000000%%~I" & set "ITEM=1!ITEM:~-9!"
                set /A "ITEM%%=1000000000, ITEM*=_MULT"
                rem // Optionally zero-pad resulting number:
                if %_ZPAD% gtr 0 (
                    if "!ITEM:~,-%_ZPAD%!"=="" (
                        set "ITEM=000000000!ITEM!" & set "ITEM=!ITEM:~-10!"
                        set "ITEM=!ITEM:~-%_ZPAD%!"
                    )
                )
                rem // Append new number to new line string:
                set "COLL=!COLL!%_SEPC%!ITEM!"
            ) else set "COLL=!COLL!%_SEPC%"
        )
        rem // Return current altered line string:
        echo(!COLL:~2!
    ) else set "LINE=#" & echo(%%J
)
endlocal

endlocal
exit /B

In the top section, set variable _MULT to a multiplication factor so that the resulting values are always less than 231 − 1, because otherwise, negative values may result due to overflow.

The variable _ZPAD defines the width which the resulting products are padded to by preceding them with enough zeros. Numbers with more digits are not truncated. The maximal width is 10 digits. No padding occurs when _ZPAD is set to zero or less.

The variable _HEAD must be set to something in case the CSV file contains a headline to be kept.

Note, that this script would fail when the CSV file contains (quoted) strings and/or quoted values that contain separator characters on their own.

The script, let us call it multiply.bat, returns the resulting lines in the console. To write them to a file, use redirection (assuming the CSV file is data.csv in the current directory):

multiply.bat "data.csv" > "data_NEW.csv"

To overwrite the original CSV file, simply execute the following command line afterwards:

move /Y "data_NEW.csv" "data.csv"
aschipfl
  • 33,626
  • 12
  • 54
  • 99
  • 1
    Your description is somewhat confusing... You say that "you cannot use `for /F loop` to read a CSV file with empty cells because it combines multiple consecutive delimiters to one. However, you could use a standard for loop... that therefore maintains the given strings". This explanation makes me think that you can _read a file_ with a standard FOR loop! – Aacini Feb 01 '22 at 19:19
  • You're right, @Aacini, thanks! I corrected the description and hope it's clearer now… – aschipfl Feb 01 '22 at 19:30
1

This is the way I would do it:

@echo off
setlocal EnableDelayedExpansion

rem Define the factor, use %random% here
set /A "factor=2"

(for /F "delims=" %%a in (input.txt) do (

   rem Get a line and eliminate multiple commas
   set "in=%%a"
   set "in=!in:,,=!"
   if "!in:~-1!" equ "," set "in=!in:~0,-1!"

   if not defined in (
      echo ,,,,,,,,,
   ) else (
      call :multiplyRow
      echo !out!
   )

)) > output.txt
goto :EOF


:multiplyRow

rem Multiply the numbers by the factor and count missing commas
set "out="
set /A "n=9,num=factor*%in:,=" & set "out=!out!!num!," & set /A "n-=1,num=factor*%" & set "out=!out!!num!"

rem Insert missing commas
for /L %%i in (1,1,%n%) do set "out=!out!,"

exit /B

The "magic line" in :MultiplyRow do several things: it multiply each comma-separated number in the %in% string by the factor and join each result to out string separated by a comma. It also decrement the commas counter for each number processed. You may appreciate this mechanism in a clearer way if you cancel the @echo off line (and the redirection to output.txt file) and carefully review the executed code.

Aacini
  • 65,180
  • 12
  • 72
  • 108
0
@ECHO OFF
SETLOCAL ENABLEDELAYEDEXPANSION
rem The following settings for the source directory, destination directory
rem filenames, output filename are names that I use for testing and deliberately include names
rem which include spaces to make sure that the process works using such names.
rem These will need to be changed to suit your situation.

SET "sourcedir=u:\your files"
SET "destdir=u:\your results"
SET "filename1=%sourcedir%\q70944027.txt"
SET "outfile=%destdir%\outfile.txt"

SET /a factor=2

SET "line="
(
FOR /f "usebackqdelims=" %%b IN ("%filename1%") DO (
 IF DEFINED line (
  SET "line=%%b"
  CALL :process
 ) ELSE (ECHO %%b&SET "line=x")
)
)>"%outfile%"

GOTO :EOF

:process
IF "%line:,=%" == "" ECHO %line%&GOTO :eof
SET "outline="
FOR %%c IN ("%line:,=" "%") DO IF "%%~c" neq "" (SET /a column=%factor% * %%~c&SET "outline=!outline!,!column!"
                                ) ELSE (SET "outline=!outline!,")
ECHO %outline:~1%
GOTO :eof

No mention is made of whether a single column may be missing in any line or whether the data may include leading zeroes.

Parenthesising the for statement allows otherwise-console output to be redirected to the output file.

The for reads each line from the source file. If the variable line is initialised to empty then the first line is simply regurgitated and line set non-empty, so a header line can be passed through. Setting line initially to some value will process every line (ie. no header line).

line is set to the text of the line read from the file, and processed by, of all things, a subroutine called process.

process first checks whether line contains only commas, by replacing each comma with nothing and checking against an empty string. If the test succeeds, the line of commas is regurgitated.

Otherwise, replace each comma in line with " " and quote the resultant string; process this with a simple for so each data-column will be quoted and each empty column will be represented by ""; all separated by spaces.

Rebuild each column by appending each term by its processed form and each empty element by nothing

Echo the result, except for the first character which will be a comma.

Magoo
  • 77,302
  • 8
  • 62
  • 84
0

This survived the Round Trip test. Source data was in TextIn.CSV, resulting multiplied data was stored in TextOut.CSV. And then TextOut.CSV was sent through the same process to divide by the same random number. The function ProcessFile takes a value such as "* 2" to multiply by 2 and a value such as "/ 2" to divide by 2. The second value passed to it is the name of the source file. I found the formula SET /A test=%RANDOM% * 100 / 32768 + 1 on this page: How to use random in BATCH script?

There are few things to consider:

  1. This code does NOT attempt to deal with a header row in a CSV file.
  2. If the source file has double quotes in it, I'm not sure of the results. The FOR command can be re-written to use the USEBACKQ parameter, but then that just changes it to apostrophes causing problems.
  3. It's not exactly the fastest running code. Will be interesting to see what it does with a CSV that has 5000 rows.
  4. I'm trying to disable the EOL function in the FOR command by setting it to the same character as DELIMS. The SS64.com's FOR command docs say that DELIMS are processed before EOL, so in theory the comma is grabbed by DELIMS before EOL sees it. It works in the code I've been using, but I can't find where anyone is saying that you can disable EOL this way.
@ECHO OFF & SETLOCAL ENABLEDELAYEDEXPANSION & GOTO :Start

:ParseLine
    SET NewLine=
    SET Line=%~1
    SET Line=!Line:,,=,$#@,!
    SET Line=!Line:,,=,$#@,!
    IF "!Line:~0,1!" EQU "," SET Line=$#@!Line!
    IF "!Line:~-1!" EQU "," SET Line=!Line!$#@
    FOR /L %%N IN (1, 1, 10000) DO (
        FOR /F "EOL=, TOKENS=1* DELIMS=," %%B IN ("!Line!") DO (
            IF "%%B" EQU "$#@" (
                SET Token=
            ) ELSE (
                SET /A Token = %%B !MathOp!
            )
            SET NewLine=!NewLine!!Token!,
            SET Line=%%C
        )
        IF "!Line!" EQU "" (
            ECHO !NewLine:~0,-1!
            GOTO :EOF
        )
    )
GOTO :EOF

:ProcessFile
    SET MathOp=%~1
    FOR /F "EOL= TOKENS=* DELIMS=" %%L IN (%~2) DO (
        if "%%L" NEQ "" (
            CALL :ParseLine "%%L"
        ) ELSE ECHO;
    )
GOTO :EOF

:Start
SET /A RandomValue=%RANDOM% * 100 / 32768 + 1
ECHO RandomValue=%RandomValue%
CALL :ProcessFile "* %RandomValue%" TextIn.CSV > TextOut.CSV
CALL :ProcessFile "/ %RandomValue%" TextOut.CSV > RoundTrip.CSV
FC TextIn.CsV RoundTrip.CSV
Darin
  • 1,423
  • 1
  • 10
  • 12