What is a more efficient way to remove duplicates from a CSV file based on specific fields using a batch script, (and gawk, if needed)?

Question

I have two csv documents that contain lists of files from a source and destination in Google Drive generated by GAM. One is called copytoarchive.csv and lists all relevant files in the source. The other is alreadyinarchive.csv and lists all relevant files already in the destination.

The way Google Drive works is to assign a UID to each file regardless of its name. The csv file list shows the file UID in one column and the file name in another.

Here is an example of what copytoarchive.csv looks like:

Owner,id,name,Parent
user@domain.com,1gyKqu_P0h3j1Vn-6EwUv_99q,PreschoolExampleLessonName-20230504_050216-Meeting Recording.mp4,1b-U8XU0jYVFCggoEH9E9wqUm
user@domain.com,14-sg-qSnn5GDAuftANdLxDXp,OneonOneExampleLessonName-20230510_043228-Meeting Recording.mp4,1TtaABRvoki4gNuyqRrlyTfzj
user@domain.com,1L9mQBJ6d3DIPbiIEyV7akArV,OtherYearLevelExampleLessonName-20230510_033024-Meeting Recording.mp4,1CdiBgePlVqPvtcEp83DmcUrotr
user@domain.com,1oHaFzLF_KcgVX-hZn5etBka9,TeacherTrainingLesson-20230510_000950-Meeting Recording.mp4,1QusVD-a9U16I-0GTP1t-Vd9Ez
user@domain.com,1_ewCEh37sZYpqZlr3TC8u2Yl,ExampleStaffMeeting-20230509_045403-Meeting Recording.mp4,1tOb7xV5OCMMebn2ab2KdXGvc
user@domain.com,1SyXjINXttrb3VKvpbjpm1y-V,ExampleLessonName-20230503_052304-Meeting Recording.mp4,13g_fYh9HYtnDtd4psHEZi

Here is what alreadyinarchive.csv looks like:

Owner,id,name,Parent
user@domain.com,1Zlnhqf6fSxTRT2JEmQS91cCX,PreschoolExampleLessonName-20230504_050216-Meeting Recording.mp4,1CdiBgePlVqPvtcEp83DmcUro
user@domain.com,1Lg2W0w8YGJytSgJl2JblBly3,OtherYearLevelExampleLessonName-20230510_033024-Meeting Recording.mp4,1TtaABRvoki4gNuyqRrlyTfzj
user@domain.com,1Q_K0D1RgZlz-LMlDUVrV0gGi,ParentTrainingLesson-20230510_000950-Meeting Recording.mp4,1b-U8XU0jYVFCggoEH9E9wq
user@domain.com,1LIrRoTGtADjQRg9IRmIlJ3oV,ExampleStaffMeeting-20230509_045403-Meeting Recording.mp4,1xVuHbE3pcWN1l7X109qTsIYZK
user@domain.com,1OHkH9Cg7i2-O-ZHXBr4wIYGZ,OneonOneExampleLessonName-20230510_043228-Meeting Recording.mp4,1U7Y2Xh4Qi3atCcVL262
user@domain.com,1jZsXB5TT0H0TRrvvZu5A3N1S,DifferentLessonName-20230503_052614-Meeting Recording.mp4,1eVS3QF_Sk_6fQkwF8PvTKQf

The data in the Owner and Parent fields is irrelevant for this part of the batch file.

How can I search the csv(s) for duplicates in just the filename field (field 3), then delete the whole record OR output to a new file only records that don't contain duplicates in field 3?

For example:

Notice that the filename from the first non header record of copytoarchive.csv matches the first non header record of alreadyinarchive.csv even though the UIDs are different. This would be marked as a duplicate.
Also notice that the filename in the second non header record of copytoarchive.csv matches the filename in the fifth non header record of alreadyinarchive.csv even though the UIDs are different. This would also be marked as a duplicate.
Further, notice that although the timestamp of TeacherTrainingLesson... (4th non header record) of copytoarchive.csv matches ParentTrainingLesson... (3rd non header record) of alreadyinarchive.csv, because the filenames don't fully match these should not be considered a duplicate.

Of approx. 2,000 records only around 300 are not duplicates.

I am happy to manipulate copytoarchive.csv and alreadyinarchive.csv into one file if needed.

At first I tried to achieve this with nested for /f loops, where the first for /f would read copytoarchive.csv one line at a time and compare the relevant token (tokens=3) against the relevant token of EVERY line of the alreadyinarchive.csv through a 2nd nested for /f loop.

As requested, here is the for /f loop that I tried:

setlocal enabledelayedexpansion

rem This code block takes the info from copytoarchive.csv and alreadyinarchive.csv, deletes any matching lines (e.g., files already in the archive), and generates filestocopy.csv which contains the old and new parent IDs for only those files needing to be copied to the archive.

set /a filenum=0
set /a totalfiles=0

for /f "delims=, tokens=2-4" %%k in (C:\path\copytoarchive.csv) do (
    set /a filenum+=1
    set /a totalfiles+=1
    call set fileID[!filenum!]=%%k
    call set filename[!filenum!]=%%l
    call :checkifexists
)

set oldfileID[1]=OldParent
set newparentID[1]=NewParent
if exist c:\path\filestocopy.csv del c:\path\filestocopy.csv
for /l %%q in (1,1,%totalfiles%) do (
    echo !newowner[%%q]!,!newparentID[%%q]!,!oldfileID[%%q]!,!newparentname[%%q]! >> c:\path\filestocopy.csv
)

exit /b

:checkifexists
    for /f "delims=, tokens=3" %%n in (C:\path\alreadyinarchive.csv) do (
        if not !filename[%filenum%]!==%%n (
            set fileparentID[%filenum%]=%%m
        )
    )
    goto :eof

While this does technically work, because each list is almost 2,000 lines long, this creates approximately 4,000,000 iterations which takes WAY TOO LONG (>10 mins on my i9-12900 PC) to complete. Please note that this is only one section of a larger batch file.

I have read up on findstr but I couldn't find a way to use it to only search one field.

I also have GNU CoreUtils installed, including gawk. Reading gawk's PDF manual, it appears that this is likely the best path to take, but I am still struggling to find the correct parameters to pass to gawk to get it to only search the "name" fields.

I found @perl's answer here which looks tantalizingly close to a solution, but don't understand enough to know how to translate it to my use case.

Any help is appreciated.

I note the question has been downvoted. I'm happy to reword, edit, or even delete the question if it is poorly written or wrongly asked. I'm still learning, so welcome any feedback. — Joshua Howard, Jun 01 '23 at 03:02
Please show the format of `-redacted-` and `-redacted file name-`. Are they fixed- or variable-length? Does the data contain non-alphameric characters? Is this fixed-column format? Does each line end `-Meeting Recording.mp4` ? Are you simply looking for values of `date_number` that are new in `copytoarchive.csv`? Is this `date_number` in a consistent column location? Can `date_number` appear within `-redacted.*-`? — Magoo, Jun 01 '23 at 03:07
Probable reason for downvote is absence of batch code you have tried. — Magoo, Jun 01 '23 at 03:08
@magoo Thanks for the reply. I will edit the question to clarify as requested. The files are recordings of online lessons. The filenames are variable in length and will contain limited non-alphanumeric characters, primarily parentheses, hyphens, and underscores. Each line does happen to end in `-meeting recording.mp4`. I am looking for files in `copytoarchive.csv` that are not in `alreadyinarchive.csv`. The `date_number` will always appear in the filename column but it alone is insufficient to determine a duplicate, as some lessons will have occured at the same time but will be different. — Joshua Howard, Jun 01 '23 at 03:14
Can you work in a *nix environment (WSL perhaps?). the `sort` command has the `-u` option (unique). Or you may find a `comm` ([in]Common) set of options and possible preprocessing that would solve this for you. Good luck. — shellter, Jun 01 '23 at 03:50
@Magoo I have now updated the question with requested edits. — Joshua Howard, Jun 01 '23 at 04:05
@shellter I do have GNU CoreUtils installed, so I will investigate if `sort` is included in that. Thanks for the suggestion. — Joshua Howard, Jun 01 '23 at 04:06
Please open your two CSV files in a plain text editor, and copy & paste that content into your question, to replace the tabular representations you have submitted. — Compo, Jun 01 '23 at 07:38
@compo I had simply copy/pasted the data from Excel into the SO editor and it appeared as tabular. I have edited the question to show as comma-delimited (which is what it is IRL) as suggested. — Joshua Howard, Jun 02 '23 at 02:33

score 2 · Answer 1 · answered Jun 01 '23 at 09:12

@ECHO OFF
SETLOCAL ENABLEDELAYEDEXPANSION 
rem The following settings for the source directory and filenames are names
rem that I use for testing and deliberately include names which include spaces to make sure
rem that the process works using such names. These will need to be changed to suit your situation.

SET "sourcedir=u:\your files"
SET "destdir=u:\your results"
:: The first file is copytoarchive.csv
SET "filename1=%sourcedir%\q76378302.txt"
:: The second file is alreadyinarchive.csv
SET "filename2=%sourcedir%\q76378302_2.txt"
SET "outfile=%destdir%\outfile.txt"

(
FOR /f "usebackqskip=1delims=" %%e IN ("%filename1%") DO (
 rem %%e has each line in turn
 FOR /f "tokens=2*delims= " %%b IN ("%%e") DO (
  rem %%c has column 3 to eol
  SET "line=%%c"
  FOR /f "delims=?" %%o IN ("!line:.mp4=?!") DO FIND "%%o.mp4" "%filename2%">nul&IF ERRORLEVEL 1 ECHO %%e?"%%o.mp4"
 )
)
)>"%outfile%"

TYPE "%outfile%"

GOTO :EOF

rem Always verify against a test directory before applying to real data.

Well, the comments within the code explain it for the most part.

The magic is the !line:.mp4=?! clause.

Replace .mp4 in the current value of line with ?.

Batch cannot substring metavariables line %%e, so it is neccessary to transfer %%c to a user-variable (line) to perform the substring. The current value of line required, so the substitution is performed in delayedexpansion mode, hence the !var! format Stephan's DELAYEDEXPANSION link

I used ? because it cannot appear in a filename.

It's not clear whether you want the actual output line or just the filename, so I've provided both separated by ?

Oh Magoo, you've done it again! ... (another amazing `.bat` file). Good luck to all! — shellter, Jun 01 '23 at 13:08
@Magoo After tweaking it to fit my specific use case, I tried this and it worked well, generating the desired output in approx. 30 seconds. Thank you for the help, and thanks for looking past my mistakes. Every time I post here I learn more about how to be better in asking questions. I'm still trying to wrap my head around exactly HOW your code achieves the outcome in such a short time. I will study further. — Joshua Howard, Jun 02 '23 at 07:36

Aacini · Accepted Answer · 2023-06-01T21:30:59.013

Mmm... First, a couple comments about your request:

From your question, it seems that UID field is also irrelevant and that the duplicates are based on filename column only, so your description about UID just make this confusing.
In your for /f "delims=, tokens=2-4" %%k in ( ... command you included the comma as delimiter, but there is not a single comma in your files! Moreover:
The filenames included a space! And
This line in your code: echo !newowner[%%q]!,!newparentID[%%q]!,!oldfileID[%%q]!,!newparentname[%%q]! indicated that the output fields are separated by a comma, and no spaces...

You have not described nor posted the real format of your files, although this was requested a couple times in the comments...

For all these reasons I assumed that your posted files have not the right format, and that the right files should be these ones:

copytoarchive.csv:

Owner,id,name,Parent
user@domain.com,1gyKqu_P0h3j1Vn-6EwUv_99q,PreschoolExampleLessonName-20230504_050216-Meeting Recording.mp4,1b-U8XU0jYVFCggoEH9E9wqUm
user@domain.com,14-sg-qSnn5GDAuftANdLxDXp,OneonOneExampleLessonName-20230510_043228-Meeting Recording.mp4,1TtaABRvoki4gNuyqRrlyTfzj
user@domain.com,1L9mQBJ6d3DIPbiIEyV7akArV,OtherYearLevelExampleLessonName-20230510_033024-Meeting Recording.mp4,1CdiBgePlVqPvtcEp83DmcUrotr
user@domain.com,1oHaFzLF_KcgVX-hZn5etBka9,TeacherTrainingLesson-20230510_000950-Meeting Recording.mp4,1QusVD-a9U16I-0GTP1t-Vd9Ez
user@domain.com,1_ewCEh37sZYpqZlr3TC8u2Yl,ExampleStaffMeeting-20230509_045403-Meeting Recording.mp4,1tOb7xV5OCMMebn2ab2KdXGvc
user@domain.com,1SyXjINXttrb3VKvpbjpm1y-V,ExampleLessonName-20230503_052304-Meeting Recording.mp4,13g_fYh9HYtnDtd4psHEZi

alreadyinarchive.csv:

Owner,id,name,Parent
user@domain.com,1Zlnhqf6fSxTRT2JEmQS91cCX,PreschoolExampleLessonName-20230504_050216-Meeting Recording.mp4,1CdiBgePlVqPvtcEp83DmcUro
user@domain.com,1Lg2W0w8YGJytSgJl2JblBly3,OtherYearLevelExampleLessonName-20230510_033024-Meeting Recording.mp4,1TtaABRvoki4gNuyqRrlyTfzj
user@domain.com,1Q_K0D1RgZlz-LMlDUVrV0gGi,ParentTrainingLesson-20230510_000950-Meeting Recording.mp4,1b-U8XU0jYVFCggoEH9E9wq
user@domain.com,1LIrRoTGtADjQRg9IRmIlJ3oV,ExampleStaffMeeting-20230509_045403-Meeting Recording.mp4,1xVuHbE3pcWN1l7X109qTsIYZK
user@domain.com,1OHkH9Cg7i2-O-ZHXBr4wIYGZ,OneonOneExampleLessonName-20230510_043228-Meeting Recording.mp4,1U7Y2Xh4Qi3atCcVL262
user@domain.com,1jZsXB5TT0H0TRrvvZu5A3N1S,DifferentLessonName-20230503_052614-Meeting Recording.mp4,1eVS3QF_Sk_6fQkwF8PvTKQf

Ok. A process like this should avoid the use of any external (.exe) command (like findstror find) in order to run faster. Environment variables are enough to solve this problem:

@echo off
setlocal EnableDelayedExpansion

rem Load alreadyinarchive's name field in "already" array putting the value *in the subscript*
rem in order to quickly check for it via IF DEFINED command
rem changing spaces in the name for underscores
for /F "tokens=3 delims=," %%a in (AlreadyInArchive.csv) do (
   set "name=%%a"
   set "name=!name: =_!"
   set "already[!name!]=1"
)

rem Process copytoarchive file and output lines with no duplicates
(for /F "tokens=1-4 delims=," %%a in (CopyToArchive.csv) do (
   set "name=%%c"
   set "name=!name: =_!"
   if not defined already[!name!] echo %%a,%%b,%%c,%%d
)) > filestocopy.csv

filestocopy.csv:

user@domain.com,1oHaFzLF_KcgVX-hZn5etBka9,TeacherTrainingLesson-20230510_000950-Meeting Recording.mp4,1QusVD-a9U16I-0GTP1t-Vd9Ez
user@domain.com,1SyXjINXttrb3VKvpbjpm1y-V,ExampleLessonName-20230503_052304-Meeting Recording.mp4,13g_fYh9HYtnDtd4psHEZi

PS - You describe with detail irrelevant data, and don't describe enough the important points... :(

Thanks for your answer. Very clever use of delayed expansion. Regarding your comments: 1. Yes, UID is irrelevant to determining duplicates, but it is important that the UIDs remain matched with the filename for commands executed later in the .bat file. 2. I copy/pasted the csv from Excel rather than a text editor which changed it from comma to tab delimited. I have corrected OP. 3. Yes, the filenames contain spaces. 4. The output indeed needs to be comma delimited. 5. As soon as I read comments asking for edits, I make them. I will keep trying hard to provide only important details. — Joshua Howard, Jun 02 '23 at 07:48
This method was even quicker at executing than @magoo's elegant solution. Thanks. — Joshua Howard, Jun 02 '23 at 08:05
Well, _any solution_ to this problem will _obviously_ preserve _all fields_ in the same records, so I don't understand where you got the idea that a solution could "change fields" (?). Your explanation that UID's are "more important" than other fields (?) have absolutely _no relation_ to this problem. Your 3 examples of "duplicate records", where you involved UIDs and timestamps, don't make sense. You should have just said: _"The duplicate records contain the same file names"_. Period. — Aacini, Jun 03 '23 at 05:27
Magoo's solution manage the particular case where filenames ends in `.mp4`. Also, it was written for fields separated by space, not commas... As I said in my answer, any solution that uses `find.exe` command will be slower than another one that don't use any `*.exe` file, just environment variables. — Aacini, Jun 03 '23 at 05:27
I'm glad it was obvious to you that all fields would be preserved. Thanks for passing the knowledge on to me, I will know now for next time. My first attempt finding a solution used GNU Coreutil's `cut` to isolate the filename fields, then `findstr` to identify unique values. This solution didn't preserve all fields (due to `cut`) which is why I specified in my question. I have learned a lot through this experience and appreciate those who generously share their knowledge. — Joshua Howard, Jun 04 '23 at 23:19

What is a more efficient way to remove duplicates from a CSV file based on specific fields using a batch script, (and gawk, if needed)?

2 Answers2