3

I have a file, in which the first string before the comma is some kind of identifier. Here is a sample:

A, bla, bla...  
B, bla, bla...  
A, bla, bla...  
C, bla, bla...

I need to parse a file to collect all unique occurences of this string. So, ideally, after processing I would have some kind of array [A, B, C]. The problem is that officially arrays are not supported in batch scripting. I know there are some workarounds, but the ones I checked out looked quite ugly.

What I have so far, is something like this:

FOR /F "tokens=1 delims=, " %%i in (%FILE%) do (
    echo %%i
)

This produces the output:

A
B
A
C

How do I eliminate the duplicate occurences of a string? What would be the elegant way to achieve this?

Please, share your thoughts, on how this problem could be solved.

jFrenetic
  • 5,384
  • 5
  • 42
  • 67
  • maybe this helps: http://stackoverflow.com/questions/11235153/how-to-find-if-a-string-is-in-a-list-of-strings-in-a-dos-batch-file – Vaelor Nov 27 '13 at 15:48

4 Answers4

7
FOR /F "tokens=1 delims=," %%i in (FILE) do ( find "%%i" "%temp%\u" >nul 2>&1 || <nul set/p=%%i,>> "%temp%\u")
type "%temp%\u"

what this does, is take the file line by line, grab everything before the first comma, and pass it in to the do. the do section of the loop attempts to find the string in a file containing the unique strings. if it does, than it returns true, and the second part is never evaluated. if it does not find it, than it writes the string followed by a comma to the file.

cure
  • 2,588
  • 1
  • 17
  • 25
  • This looks nice, could you explain this a little bit? especially the part after after the do ( ... ) – Vaelor Nov 27 '13 at 16:05
  • 4
    Hey, it works! Side note for the OP: `%TEMP%\u` could be replaced with the name of a file in the current directory, if you need to keep it. Your coding skills look great. Raise your communicaiton skills to the same level and you will be a great programmer! – ixe013 Nov 27 '13 at 19:29
  • Thank you! That is very encouraging. – cure Nov 27 '13 at 20:23
  • 1
    Wow, this is indeed very elegant solution. The only problem with it, is that after I take a look at it month later, I'd be wondering what the hell this line does :) – jFrenetic Nov 27 '13 at 21:17
  • @nephi12 - Most likely a user and/or question was deleted that was a source of some of your points. – dbenham Nov 28 '13 at 04:31
  • 1
    @nephi12 Don't worry, with answers like this, you'll gain your rep very quickly :) – jFrenetic Nov 28 '13 at 08:51
  • This method is very slow: it requires to execute the 16KB `find.exe` file _with every input line_. – Aacini Nov 29 '13 at 00:33
3

Here is another way:

@echo off
setlocal enabledelayedexpansion
set if=dedupe.txt
for /f "tokens=1 delims=," %%A in ('Sort %if%') do if not defined Line set Line="%%A"
for /f "tokens=1 delims=," %%A in ('Sort %if%') do call :dedupe "%%A"
Call :dedupe ""
for /l %%B in (1,1,%i%) do echo !Line[%%B]!
exit /b

:dedupe
if %1 EQU %Line% goto :eof
set /a i+=1
set Line[%i%]=%Line:~1,-1%
set Line=%1
Matt Williamson
  • 6,947
  • 1
  • 23
  • 36
  • This is more "sane" solution. I accepted the other answer, because it was quite original. But thanks for your input, I appreciate it. – jFrenetic Nov 27 '13 at 21:22
3

I think the simplest way is precisely to use an array with elements with unique subscripts, for example:

FOR /F "tokens=1 delims=, " %%i in (%FILE%) do (
   set id[%%i]=X
)

FOR /F "tokens=2 delims=[]" %%a in ('set id[') do echo %%a

I suggest you to review this post , perhaps your opinion about arrays in Batch change.

EDIT: Example added

For example, with this input:

A, bla, bla...  
B, bla, bla...  
A, bla, bla...  
C, bla, bla...

the code would execute:

set id[A]=X
set id[B]=X
set id[A]=X
set id[C]=X

Because there is no way to define two different elements with the same subscript, the defined elements at end would be: id[A], id[B] and id[C].

Community
  • 1
  • 1
Aacini
  • 65,180
  • 12
  • 72
  • 108
  • i may be wrong, but i don't think this addresses the issue of unique values at all... – cure Nov 28 '13 at 16:18
  • Hey, I know this was forever ago, but I wanted to apologize for down voting it. I didn't understand it back then and thought it was incorrect. It is been too long and I can't change my vote, so I thought I'd comment. Very cool solution to the problem. – cure Feb 09 '17 at 16:34
2
@echo off
    setlocal enableextensions enabledelayedexpansion
    set "seen="
    for /f "tokens=*" %%l in ('cmd /q /c "for /f delims^=^, %%a in (file) do echo %%a" ^| sort') do (
        if not "%%l"=="!seen!" (
            echo %%l
            set "seen=%%l"
        )
    )

This takes the file file, split by ,, echo the first token, sorts the generated list, and iterates over this list. For each element, if it has not been seen, echo and remember this new seen element. As the list is ordered, just remembering the last seen element will be enough.

MC ND
  • 69,615
  • 8
  • 84
  • 126
  • @nephi12: 1- Not knowing how many different values, and having a limit of 32767 characters in a variable, adding the data to a "array" can be problematic. Better option is to create a variable for each different value (which also solves your 4). 2- Yes, really. One process to tokenize, one to sort, one to find unique elements, with data flowing between them. Have you tried with a 100 or more lines file? 3- This just shows how find unique elements. If reuse needed, send data to a temporary file 4- If this where a requirement (not in op question), then the response would be other. – MC ND Nov 27 '13 at 20:08
  • I don't care about the performance in this case, because this is just a small utility script with a tiny data file. So, this solution also works for me. Thanks for sharing, and for the explanation. – jFrenetic Nov 27 '13 at 21:28
  • 1
    @nephi12 I think down-voting is meant for answers that are obviously not going to work at all. Answering the question in a different manner is fair and reasonable, as people reading in the future can choose the method that suits them best. – foxidrive Nov 29 '13 at 02:36
  • 1
    @nephi12 I felt I should comment because Aacini's method received a downvote too. – foxidrive Nov 29 '13 at 05:12