Grep, awk not working on HTML curl file (getting link from html)

Question

I have already read many many pages here on stackoverflow, but nothing does work on my szenario.

I want to get the last matching (or all) URLs containing "cedock" from this website: "https://4pda.ru/forum/index.php?showtopic=973246&st=4040#Spoil-97613600-1"

When I save the file and then do a search in my file editor it works fine, but none of these commands did work for me to get the urls or filter anything in this file:

curl -k -s "https://4pda.ru/forum/index.php?showtopic=973246&st=4040#Spoil-97613600-1" | awk -F'SRC="|"' '/SRC/ && /'"cedock"'/  {print $4}'

curl -k -s "https://4pda.ru/forum/index.php?showtopic=973246&st=4040#Spoil-97613600-1" | grep -o 'FOTA-OTA/V8-R851T02-LF1V342.014883.zip.*zip</a><br /></div></div><br'

grep "<a href=" 4pda.txt |sed "s/<a href/\\n<a href/g" |sed 's/\"/\"><\/a>\n/2' |grep href |sort |uniq

Is there something broke with the website itself? As I am using similar commands on other websites and there it is working.

Desired output is the latest download url from cedock, so for example right now: http://na-update.cedock.com/apps/resource2/V8R851T02/V8-R851T02-LF1V351/FOTA-OTA/V8-R851T02-LF1V351.015103.zip

[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Jul 11 '20 at 09:17
C'mon, investigate! Try the curl command, and look at the output. Or check the output for your keyword: `curl -s https://.... | grep cedock`. You'll get nothing. So, think about why that could be? What is different between curl and your browser? Javascript! Page content must be generated by JS in the browser, that does not happen using curl. — Don't Panic, Jul 11 '20 at 09:18
@Cyrus xmlstarlet can only read XML, an the output of this page is not XML..... — Luuk, Jul 11 '20 at 09:39
@Don'tPanic thanks for the info, indeed I haven't thought about JS with curl/wget. But the output with curl saved as a text file contains the needed url lines, so it seems to work and not to be the issue with my problem here and so not the solution? — FaserF, Jul 11 '20 at 14:28
I guess we are getting different results then. As I wrote, I see nothing matching `cedock`. — Don't Panic, Jul 12 '20 at 00:42

Cyrus · Accepted Answer · 2020-07-27T14:31:14.513

1

With xmlstarlet:

curl -k -s 'https://4pda.ru/forum/index.php?showtopic=973246&st=4040#Spoil-97613600-1' \
  | xmlstarlet format --html 2>/dev/null \
  | xmlstarlet select --template --value-of '//html/body/div/div[10]/div[2]/div[1]/div[2]/a[last()]/@href' -n

Output:

http://na-update.cedock.com/apps/resource2/V8R851T02/V8-R851T02-LF1V351/FOTA-OTA/V8-R851T02-LF1V351.015103.zip

I have used xmlstarlet format --html to save the correctable parts from broken HTML.

Update

To get last URL with Domain na-update.cedock.com:

curl -k -s 'https://4pda.ru/forum/index.php?showtopic=973246&st=4040#Spoil-97613600-1' \
  | xmlstarlet format --html 2>/dev/null \
  | xmlstarlet select --template --value-of '//a[last()]/@href[.=contains(.,"http://na-update.cedock.com")]' -n

edited Jul 27 '20 at 14:31

answered Jul 11 '20 at 10:19

Cyrus

84,225
14
89
153

Thank you, it worked before. Now a new link has been added to that thread and the script doesnt work. The output is empty instead being now the last link: http: //na-update.cedoc..T02-LF1V356.015216 .zip http://na-update.cedock.com/apps/resource2/V8R851T02/V8-R851T02-LF1V356/FOTA-OTA/V8-R851T02-LF1V356.015216.zip And I am not able to find the error. – FaserF Jul 27 '20 at 09:38
Replace `//html/body/div/div[10]/div[2]/div[1]/div[2]/a[last()]/@href` with `//html/body/div/div[8]/div[2]/div[1]/div[2]/a[last()]/@href`. HTML pages are a moving target. – Cyrus Jul 27 '20 at 10:30
I've updated my answer with an alternative solution. – Cyrus Jul 27 '20 at 14:31

Grep, awk not working on HTML curl file (getting link from html)

1 Answers1