Unique fields analysis into formatted strings

Question

this is my first post. I'm in the process of starting some sort of study on a collection of strings formatted in a URL like fashion. Let's say that I have a file with strings like:

A/B/C/D

For me this string has 4 components. The strings into the file have different lengths. I'm search for an efficient way, maybe in BASH, to obtain the number of unique strings per each field.

I would really appreciate any help or hint!

Thanks!

Mic

I understand the motivation behind doing something like this in Bash, or Python, or Perl, but … C++? really? — BRPocock, Jan 22 '14 at 17:50
See http://stackoverflow.com/questions/13648410/how-can-i-get-unique-values-from-an-array-in-linux-bash and `vontrapp`'s answer for how to do this in bash. — Reinstate Monica Please, Jan 22 '14 at 19:22
If, by your definition, `A/B/C/D` is a "string", and (I'm guessing, here) `A`, `B`, `C` and `D` are "fields", then I'm a little unclear on the concept of "number of unique strings per each field"... Do you mean the number of fields in each string? Or the number of different strings a particular field value appears in? Or something else? What does "unique" have to do with it? Unique fields or unique strings? — twalberg, Jan 22 '14 at 19:28

wnnmaw · Answer 1 · 2014-01-23T15:37:40.350

Assuming the strings are always delimited by /'s, here's how I would do it in Python

start1 = "A/B/C/D"
start2 = "B/D/E/A/B"
start3 = "D/A/A/B/D/C"
start4 = "C"

startList = [start1, start2, start3, start4]
print "startList: ", startList
fields = []

for start in startList:
    for field in start.split('/'):
        fields.append(field)

print "fields: ", fields

countDict = dict.fromkeys(fields)
print "countDict 1: ", countDict

for entry in countDict.keys():
    countDict[entry] = fields.count(entry)

print "countDict 2: ", countDict

Here is what the print statements output:

startList: ['A/B/C/D', 'B/D/E/A/B', 'D/A/A/B/D/C', 'C']
fields: ['A', 'B', 'C', 'D', 'B', 'D', 'E', 'A', 'B', 'D', 'A', 'A', 'B', 'D', 'C', 'C']
countDict 1: {'A': None, 'C': None, 'B': None, 'E': None, 'D': None}
countDict 2: {'A': 4, 'C': 3, 'B': 4, 'E': 1, 'D': 4}

However, if the starting string is giant (millions of entries) and speed really matters, Python is probably not your best choice. Its easy to learn, and very readable (and my favorite language), but its just not as fast as compiled languages like C. That being said, its fast enough for the vast majority of applications

A note on this particular method. There are plenty of 'fancier' ways to count the entries in a list. Many are faster and more "pythonic", but this should suffice for your purposes. If you want to see these methods, just do a quick search around the site. If anything in this method is unclear, let me know, hope this helps!

If what you want is the number of unique entries in each string, this is what you're looking for:

start1 = "A/B/C/D"
start2 = "B/D/E/A/B"
start3 = "D/A/A/B/D/C"
start4 = "C"

startList = [start1, start2, start3, start4]
print "startList: ", startList

countDict = dict.fromkeys(startList)
print "countDict 1: ", countDict

for start in startList:
    countDict[start] = len(set(start.split('/')))

print "countDict 2: ", countDict

Here is what the print statements output:

startList:  ['A/B/C/D', 'B/D/E/A/B', 'D/A/A/B/D/C', 'C']
countDict 1:  {'B/D/E/A/B': None, 'A/B/C/D': None, 'C': None, 'D/A/A/B/D/C': None}
countDict 2:  {'B/D/E/A/B': 4, 'A/B/C/D': 4, 'C': 1, 'D/A/A/B/D/C': 4}

Thanks for the answer. The problem is that I have to analyze the number of unique strings in each field (keeping track of the field itself) for a bunch of strings, and not inside the same string. Let's say: string_1 = A/B/C/D; string_2 = A/B/F/G. So the result would be: Field_1 = 1; Field_2 = 1; Field_3 = 2; Field_4 = 2. — Estoque'm', Jan 23 '14 at 10:00
Oh, got it, sorry I misunderstood your question, I'll update in a bit — wnnmaw, Jan 23 '14 at 12:08
thanks! Now it's much closer to what I want. If I'm not wrong, your countDict 1 expresses the uniqueness of the 'A' character in the first field, that of the 'C' in the second field, and so on. However, my ideal result would be, according to the strings you wrote: Field_1 = 4 different characters; Field_2 = 3; Field_3 = 3; Field_4 = 3; Field_5 = 2; Field_6 = 1. Hope to be clear. Thanks a lot again! — Estoque'm', Jan 23 '14 at 15:17
@user3224616, if thats what you want, then my first answer was closer, I'll update again — wnnmaw, Jan 23 '14 at 15:33
ok never mind..may be I cannot clearly explain what I want. For me, always according to your strings, the longest string is 'start3', which has 6 fields. Ok, now, per each field, I want to do an overall analysis considering also the fields of the other strings. In this particular case, we do no have duplicates in any field, so the number of unique words per each field will be equal to the number of the occurrences of each field, as I wrote in the previous comment. — Estoque'm', Jan 23 '14 at 16:05
@user3224616 This is the importance of posting a clear question. Please update your original question to include definitions of important terms (such as "field"), example input (which you already have), and expected output — wnnmaw, Jan 23 '14 at 16:11

score 0 · Answer 2 · answered Jan 22 '14 at 17:49

If you're concerned with individual parts:

for n in 1 2 3 4 5 6 7
do
    echo "for field # $n, unique values:"
    cut -d / -f $n collection-of-strings | sort | uniq -c
done

If you're looking at URI-type prefixes:

for n in 1 2 3 4 5 6 7
do
    echo "for fields # 1…$n, unique prefices:"
    cut -d / -f 1-$n collection-of-strings | sort | uniq -c
done

This assumes you've no more than 7 fields, adjust the for loop accordingly if there are longer strings.

score 0 · Answer 3 · edited May 23 '17 at 12:03

Given the "URL" bit, assuming you mean count unique components and not number of words within each component. Then I probably wouldn't use bash for this, out of simplicity, but if I had to would do something like

Check that input contains /
```
[[ $input == *"/"* ]]
```
Check that input does not contain whitespace characters
```
[[ $input != *[[:space:]]* ]]
```

Set the internal field separator (IFS) to /

IFS="/" #Note you are doing this in a shell script and not directly in a shell

Make an array out of the input.
```
arr=($input)
```
Make the array unique. See https://stackoverflow.com/a/17758600/3076724 for probably simplest answer

Then print number of components/do something with each

echo "Number of components in $input = ${#arr[@]}"
for i in "${arr[@]}"; do
  #Do something with each component "$i"
done

That should get you started, and you can easily join them together to make a working shell script.

Unique fields analysis into formatted strings

3 Answers3