0

I am trying to extract and clean the domains from a list of URLs. I read the post
How to extract domain name from url?
So far I can do this

$ URI="http://user:pw@example.com:80/"
$ echo $URI | sed -e 's/[^/]*\/\/\([^@]*@\)\?\([^:/]*\).*/\2/'
example.com

But in my list of URLS there are also some cases like below

example1.comDNT:
example2.comContent-Length:

I want to have output as below

example1.com
example2.com

Can I use python to solve this
Any advice would be appreciated
Thanks alot

cacalun12
  • 55
  • 5
  • 3
    This can't be done with 100% accuracy. What is the domain name in `example.coM` for example -- is it `example.co` or `example.com`? The best you can do could be to download the [public suffix list](https://wiki.mozilla.org/Public_Suffix_List) and make an attempt to guess the TLD. – Selcuk Mar 10 '22 at 06:54
  • @Selcuk The domain name format is like this ```example1.comDNT:``` and some like this ```example2.comContent-Length:``` – cacalun12 Mar 10 '22 at 06:57
  • Try `IFS=/ read -r _ _ dom _ <<<"$URI" ;dom=${dom%:*} dom=${dom#*@}; echo $dom` and have a look at [How do I split a string on a delimiter in Bash?](https://stackoverflow.com/a/15988793/1765658) – F. Hauri - Give Up GitHub Mar 10 '22 at 07:13
  • @Selcuk Are `example1.comDNT:` in your *source* list or is this result of your script? – F. Hauri - Give Up GitHub Mar 10 '22 at 07:16
  • At last, you could `echo ${dom%%+([A-Z])*}` (see `shopt -s extglob` if needed). This will delete from 1st upper case alphabetical character to end of string. But this is away from perfect. See [@Selcuk comment](https://stackoverflow.com/questions/71419919/clean-and-extract-domain-name-from-url#comment126237489_71419919). – F. Hauri - Give Up GitHub Mar 10 '22 at 07:26

1 Answers1

-1

Could you try this:

echo $URI | awk -F'http://user:pw@' '{print $2}' | sed 's/\.com.*/.com/'
m477so
  • 13
  • 3