1

I have a list of URLs in different format that were extracted from a random website:

http://www.w3.org/2000/svg http://www.w3.org/1999/xlink    
/bg-images/png/search-magnifying-glass.png    
http://www.boston.com/weather?p1=BGMenu_SubnavBostonGlobe.com    
http://www.w3.org/2000/svg 
http://www.w3.org/1999/xlink    
/bg-images/png/search-magnifying-glass.png http://www.w3.org/2000/svg
http://www.w3.org/1999/xlink 
/bg-images/png/bg-logo--full.png            
http://www.w3.org/2000/svg 
http://www.w3.org/1999/xlink    
/bg-images/png/bg-logo--bug.png 
https://www.bostonglobe.com    
https://www.bostonglobe.com    
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking        
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking    
http://www.w3.org/1999/xlink /bg-images/png/bg-logo-large--full.png    
http://www.boston.com/section/cars?s_campaign=bg:hp:mainnav:cars    
http://realestate.boston.com?s_campaign=bg:hp:mainnav:realestate    
http://www.w3.org/2000/svg http://www.w3.org/1999/xlink

They all are in different format (optional http/https/www). I need to filter it to get any kind of "downloadable" content such as *jpg, *png, *html, etc.

Expected output:

/bg-images/png/search-magnifying-glass.png      
/bg-images/png/search-magnifying-glass.png 
/bg-images/png/bg-logo--full.png                
/bg-images/png/bg-logo--bug.png     
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking        
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking  (not sure about these yet just in case)  
http://www.w3.org/1999/xlink /bg-images/png/bg-logo-large--full.png    

this is my first time trying to write regex, and I came up with something like that: (https?/\/)?(www\.)?[-a-zA-Z0-9@:;%._\+~\/#=]{2,256}\.[a-z]{2,4}a{0,1}\b([-a-zA-Z0-9@:;!%_\+.,~#?&//=]*)

which outputs a lots of trash lines. Any advice?

Idriss Neumann
  • 3,760
  • 2
  • 23
  • 32
  • Why don't you consider `https://www.bostonglobe.com` to be "downloadable content"? What are your criteria? – glenn jackman Jun 18 '18 at 18:52
  • Umm, there is a ton of URI which does not have suffix *.png or similar. Image can be pretty much anything. For example many REST services use arguments to identify image: http://foo.com/image?id=123456 – eocron Jun 18 '18 at 18:56
  • Is that the actual text file you have to extract data from, or is it jacked when you posted it ? –  Jun 18 '18 at 19:21
  • Normally you'd use a modified URL validator, and capture the path. In this case it's in capture group 1. `(?m)^(?!mailto:)(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))|localhost)(?::\d{2,5})?(\/[^\s]+)$` –  Jun 18 '18 at 19:25
  • @sln it can be any html file, doesn't depend on that. – Igor Kamalov Jun 18 '18 at 21:46
  • @glennjackman I wish i could use that, the goal is to use grep awk send etc. – Igor Kamalov Jun 18 '18 at 21:47

2 Answers2

1

Since your sample Input_file is having space at last of the lines so I am using sub to remove those spaces, in case they are not there then you could remove it. Could you please try following and let me know if this helps you.

awk '{sub(/ *$/,"")}
(/^http/||/^https/||/^www/||/^\//) && \
(/.*png$/||/.*html$/||/.*jpg$/||/BGHeader_SmartBar_Breaking$/)
'  Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
0

Instead of fetching some questionable URL from some questionable feed, you need to manually check them, because URL in general, DO NOT contain information about it's content. Many storage services uses ID to identify image, not names with extensions. But headers do contain this information:

How to get content type of a web address?

So as to what is downloadable? Everything. I mean literaly everything you see is downloadable. For example, for images content types will be something like these:

image/gif, image/png, image/jpeg, image/bmp, image/webp

For audio/video:

audio/midi, audio/mpeg, audio/webm, audio/ogg, audio/wav

Partially full list can be found here: http://htmlbook.ru/html/value/mime

As to solution - just sniff every link in multiple IO threads. This way you also will be able to filter those which need some authentication, were expired or invalid in first place. Usually its pretty cheap requests.

eocron
  • 6,885
  • 1
  • 21
  • 50
  • Thank you so much for your help. I'm actually looking to select all the dependent resources for the website. I'm still in the process of establishing the criteria for those. It doesn't necessarily have to be everything, mostly CSS javascript images and and stuff from JS – Igor Kamalov Jun 18 '18 at 21:45