3

I have a list of URLs named urls.list:

https://target.com/?first=one
https://target.com/something/?first=one
http://target.com/dir/?first=summer
https://fake.com/?first=spring
https://example.com/about/?third=three
https://example.com/?third=three

and I want to make them unique based on their domains like https://target.com, That means each domain with its protocol prints once and the next URLs are avoided. so the result would be:

https://target.com/?first=one
http://target.com/dir/?first=summer
https://fake.com/?first=spring
https://example.com/about/?third=three

This is what I tried to do:

cat urls.list | cut -d"/" -f1-3 | awk '!a[$0]++' >> host_unique.del

for urls in $(cat urls.list); do

    for hosts in $(cat host_unique.del); do
        if [[ $hosts == *"$urls"* ]]; then
            echo "$hosts"
        fi
    done
done
sof31
  • 87
  • 5
  • 3
    Tangentially, [don't read lines with `for`](http://mywiki.wooledge.org/DontReadLinesWithFor) and avoid the [useless use of `cat`](https://stackoverflow.com/questions/11710552/useless-use-of-cat) – tripleee May 24 '21 at 05:44

2 Answers2

5

This awk might do what you wanted.

awk -F'/' '!seen[$1,$3]++' urls.list

A bash alternative would be very slow on large set of data/files but here it is.

Using mapfile aka readarray which is a bash4+ feature, associative array. plus some more bash features.

#!/usr/bin/env bash

declare -A uniq
mapfile -t urls < urls.list

for uniq_url in "${urls[@]}"; do
  IFS='/' read -ra url <<< "$uniq_url"
  if ((!uniq["${url[0]}","${url[2]}"]++)); then
    printf '%s\n' "$uniq_url"
  fi
done
Jetchisel
  • 7,493
  • 2
  • 19
  • 18
2

With your shown samples, please try following.

awk 'match($0,/https?:\/\/[^/]*/){val=substr($0,RSTART,RLENGTH)} !arr[val]++' Input_file

Explanation: Adding detailed explanation for above.

awk '                               ##Starting awk program from here.
match($0,/https?:\/\/[^/]*/){       ##using match to match http or https followedby ://
  val=substr($0,RSTART,RLENGTH)     ##Creating val which has matched string value here.
}
!arr[val]++                         ##Checking condition if val not present in arr then print current line.
' Input_file                        ##Mentioning Input_file name here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93