I have two csv documents that contain lists of files from a source and destination in Google Drive generated by GAM. One is called copytoarchive.csv
and lists all relevant files in the source. The other is alreadyinarchive.csv
and lists all relevant files already in the destination.
The way Google Drive works is to assign a UID to each file regardless of its name. The csv file list shows the file UID in one column and the file name in another.
Here is an example of what copytoarchive.csv
looks like:
Owner,id,name,Parent
user@domain.com,1gyKqu_P0h3j1Vn-6EwUv_99q,PreschoolExampleLessonName-20230504_050216-Meeting Recording.mp4,1b-U8XU0jYVFCggoEH9E9wqUm
user@domain.com,14-sg-qSnn5GDAuftANdLxDXp,OneonOneExampleLessonName-20230510_043228-Meeting Recording.mp4,1TtaABRvoki4gNuyqRrlyTfzj
user@domain.com,1L9mQBJ6d3DIPbiIEyV7akArV,OtherYearLevelExampleLessonName-20230510_033024-Meeting Recording.mp4,1CdiBgePlVqPvtcEp83DmcUrotr
user@domain.com,1oHaFzLF_KcgVX-hZn5etBka9,TeacherTrainingLesson-20230510_000950-Meeting Recording.mp4,1QusVD-a9U16I-0GTP1t-Vd9Ez
user@domain.com,1_ewCEh37sZYpqZlr3TC8u2Yl,ExampleStaffMeeting-20230509_045403-Meeting Recording.mp4,1tOb7xV5OCMMebn2ab2KdXGvc
user@domain.com,1SyXjINXttrb3VKvpbjpm1y-V,ExampleLessonName-20230503_052304-Meeting Recording.mp4,13g_fYh9HYtnDtd4psHEZi
Here is what alreadyinarchive.csv
looks like:
Owner,id,name,Parent
user@domain.com,1Zlnhqf6fSxTRT2JEmQS91cCX,PreschoolExampleLessonName-20230504_050216-Meeting Recording.mp4,1CdiBgePlVqPvtcEp83DmcUro
user@domain.com,1Lg2W0w8YGJytSgJl2JblBly3,OtherYearLevelExampleLessonName-20230510_033024-Meeting Recording.mp4,1TtaABRvoki4gNuyqRrlyTfzj
user@domain.com,1Q_K0D1RgZlz-LMlDUVrV0gGi,ParentTrainingLesson-20230510_000950-Meeting Recording.mp4,1b-U8XU0jYVFCggoEH9E9wq
user@domain.com,1LIrRoTGtADjQRg9IRmIlJ3oV,ExampleStaffMeeting-20230509_045403-Meeting Recording.mp4,1xVuHbE3pcWN1l7X109qTsIYZK
user@domain.com,1OHkH9Cg7i2-O-ZHXBr4wIYGZ,OneonOneExampleLessonName-20230510_043228-Meeting Recording.mp4,1U7Y2Xh4Qi3atCcVL262
user@domain.com,1jZsXB5TT0H0TRrvvZu5A3N1S,DifferentLessonName-20230503_052614-Meeting Recording.mp4,1eVS3QF_Sk_6fQkwF8PvTKQf
The data in the Owner and Parent fields is irrelevant for this part of the batch file.
How can I search the csv(s) for duplicates in just the filename field (field 3), then delete the whole record OR output to a new file only records that don't contain duplicates in field 3?
For example:
Notice that the filename from the first non header record of
copytoarchive.csv
matches the first non header record ofalreadyinarchive.csv
even though the UIDs are different. This would be marked as a duplicate.Also notice that the filename in the second non header record of
copytoarchive.csv
matches the filename in the fifth non header record ofalreadyinarchive.csv
even though the UIDs are different. This would also be marked as a duplicate.Further, notice that although the timestamp of
TeacherTrainingLesson...
(4th non header record) ofcopytoarchive.csv
matchesParentTrainingLesson...
(3rd non header record) ofalreadyinarchive.csv
, because the filenames don't fully match these should not be considered a duplicate.
Of approx. 2,000 records only around 300 are not duplicates.
I am happy to manipulate copytoarchive.csv
and alreadyinarchive.csv
into one file if needed.
At first I tried to achieve this with nested for /f
loops, where the first for /f
would read copytoarchive.csv
one line at a time and compare the relevant token (tokens=3
) against the relevant token of EVERY line of the alreadyinarchive.csv
through a 2nd nested for /f
loop.
As requested, here is the for /f
loop that I tried:
setlocal enabledelayedexpansion
rem This code block takes the info from copytoarchive.csv and alreadyinarchive.csv, deletes any matching lines (e.g., files already in the archive), and generates filestocopy.csv which contains the old and new parent IDs for only those files needing to be copied to the archive.
set /a filenum=0
set /a totalfiles=0
for /f "delims=, tokens=2-4" %%k in (C:\path\copytoarchive.csv) do (
set /a filenum+=1
set /a totalfiles+=1
call set fileID[!filenum!]=%%k
call set filename[!filenum!]=%%l
call :checkifexists
)
set oldfileID[1]=OldParent
set newparentID[1]=NewParent
if exist c:\path\filestocopy.csv del c:\path\filestocopy.csv
for /l %%q in (1,1,%totalfiles%) do (
echo !newowner[%%q]!,!newparentID[%%q]!,!oldfileID[%%q]!,!newparentname[%%q]! >> c:\path\filestocopy.csv
)
exit /b
:checkifexists
for /f "delims=, tokens=3" %%n in (C:\path\alreadyinarchive.csv) do (
if not !filename[%filenum%]!==%%n (
set fileparentID[%filenum%]=%%m
)
)
goto :eof
While this does technically work, because each list is almost 2,000 lines long, this creates approximately 4,000,000 iterations which takes WAY TOO LONG (>10 mins on my i9-12900 PC) to complete. Please note that this is only one section of a larger batch file.
I have read up on findstr
but I couldn't find a way to use it to only search one field.
I also have GNU CoreUtils installed, including gawk
. Reading gawk
's PDF manual, it appears that this is likely the best path to take, but I am still struggling to find the correct parameters to pass to gawk
to get it to only search the "name" fields.
I found @perl's answer here which looks tantalizingly close to a solution, but don't understand enough to know how to translate it to my use case.
Any help is appreciated.