Extract a specific string from text file and create HTTP request with the extract string

Question

I'm trying to extract a specific string value from a text file (file1.txt)
then to create HTTP GET request with the extracted string (url address)
the HTTP response should be saved as a new HTML file in the directory.

The string I'm trying to extract is a value of a specific key. For example: "display_url":"test.com" (extract "test.com" and then to create http request)

The structure of file1.txt content could be multiple instances of display_url, since it is in a list under urls. if there is more then one value I want to make HTTP request for each one of them.

My txt file content:

{"created_at":"Thu Nov 15 11:35:00 +0000 2018","id":15292802,"id_str":325802","text":"test8 https://test/ZtCsuk7Ek2 #osining","source":"\u003ca href=\"http://twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":961508561217052675,"id_str":"961508561217052675","name":"Online S","screen_name":"osectraining","location":"Israel","url":"https://www.test.co.il","description":"test","translator_type":"none","protected":false,"verified":false,"followers_count":2,"friends_count":51,"listed_count":0,"favourites_count":0,"statuses_count":7,"created_at":"Thu Feb 08 07:54:39 +0000 2018","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":false,"profile_link_color":"1B95E0","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http://pbs.twimg.com/profile_images/961510231346958336/d_KhBeTD_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/961510231346958336/d_KhBeTD_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/961508561217052675/1518076913","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"osectraining","indices":[33,46]}],"urls":[{"url":"https://test/ZtCsuk7Ek2","expanded_url":"http://test.com","display_url":"test.com","indices":[7,30]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1542281700508"}

The structure of your file content implies that there could be multiple instances of `display_url`, since it is in a list under `urls`. What should happen if multiple are found? — lxop, Nov 21 '18 at 09:17
You are asking multiple questions at once. It would be wise to split up your question. — Micha Wiedenmann, Nov 21 '18 at 09:19
This talks about parsing JSON with BASH: https://stackoverflow.com/questions/1955505/parsing-json-with-unix-tools — Micha Wiedenmann, Nov 21 '18 at 09:19
Also, your txt file has a bonus `"` in it at the end of the `id_str` value — lxop, Nov 21 '18 at 09:24
Does your txt file contains valid JSON data? Because your example is not valid JSON even thought it looks like JSON. — cn007b, Nov 21 '18 at 12:52

score 0 · Accepted Answer · answered Nov 21 '18 at 13:42

0

1) Looks like your file is not valid JSON file so for step #1 you have to do something like this:

url=${ cat /tmp/x.txt | grep -oP '(?<=display_url":")[^"]+' }

2 & 3) Now you can do something like this:

curl $url -O /tmp/x.html

In case you have > 1 display_urls - you have to use loop, like this:

for url in $display_urls; do
    curl $url -O /tmp/$url.html
done

answered Nov 21 '18 at 13:42

cn007b

16,596
7
59
74

and what if the file is in append mode? (I mean every few more time content is added to the file) – bugnet17 Nov 21 '18 at 16:29
In this case use: `curl $url >> /tmp/x.html`. – cn007b Nov 21 '18 at 16:44
In this case you have to have loop which traverse array obtained from `grep -oP '(?<=display_url":")[^"]+'`. – cn007b Nov 21 '18 at 17:02

Extract a specific string from text file and create HTTP request with the extract string

1 Answers1