-2

I have already read many many pages here on stackoverflow, but nothing does work on my szenario.

I want to get the last matching (or all) URLs containing "cedock" from this website: "https://4pda.ru/forum/index.php?showtopic=973246&st=4040#Spoil-97613600-1"

When I save the file and then do a search in my file editor it works fine, but none of these commands did work for me to get the urls or filter anything in this file:

curl -k -s "https://4pda.ru/forum/index.php?showtopic=973246&st=4040#Spoil-97613600-1" | awk -F'SRC="|"' '/SRC/ && /'"cedock"'/  {print $4}'

curl -k -s "https://4pda.ru/forum/index.php?showtopic=973246&st=4040#Spoil-97613600-1" | grep -o 'FOTA-OTA/V8-R851T02-LF1V342.014883.zip.*zip</a><br /></div></div><br'

grep "<a href=" 4pda.txt |sed "s/<a href/\\n<a href/g" |sed 's/\"/\"><\/a>\n/2' |grep href |sort |uniq

Is there something broke with the website itself? As I am using similar commands on other websites and there it is working.

Desired output is the latest download url from cedock, so for example right now: http://na-update.cedock.com/apps/resource2/V8R851T02/V8-R851T02-LF1V351/FOTA-OTA/V8-R851T02-LF1V351.015103.zip

FaserF
  • 19
  • 4
  • 4
    [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Jul 11 '20 at 09:17
  • 3
    C'mon, investigate! Try the curl command, and look at the output. Or check the output for your keyword: `curl -s https://.... | grep cedock`. You'll get nothing. So, think about why that could be? What is different between curl and your browser? Javascript! Page content must be generated by JS in the browser, that does not happen using curl. – Don't Panic Jul 11 '20 at 09:18
  • 1
    @Don'tPanic that should posted as answer – alecxs Jul 11 '20 at 09:24
  • @Cyrus xmlstarlet can only read XML, an the output of this page is not XML..... – Luuk Jul 11 '20 at 09:39
  • @Don'tPanic thanks for the info, indeed I haven't thought about JS with curl/wget. But the output with curl saved as a text file contains the needed url lines, so it seems to work and not to be the issue with my problem here and so not the solution? – FaserF Jul 11 '20 at 14:28
  • I guess we are getting different results then. As I wrote, I see nothing matching `cedock`. – Don't Panic Jul 12 '20 at 00:42

1 Answers1

1

With xmlstarlet:

curl -k -s 'https://4pda.ru/forum/index.php?showtopic=973246&st=4040#Spoil-97613600-1' \
  | xmlstarlet format --html 2>/dev/null \
  | xmlstarlet select --template --value-of '//html/body/div/div[10]/div[2]/div[1]/div[2]/a[last()]/@href' -n

Output:

http://na-update.cedock.com/apps/resource2/V8R851T02/V8-R851T02-LF1V351/FOTA-OTA/V8-R851T02-LF1V351.015103.zip

I have used xmlstarlet format --html to save the correctable parts from broken HTML.

Update

To get last URL with Domain na-update.cedock.com:

curl -k -s 'https://4pda.ru/forum/index.php?showtopic=973246&st=4040#Spoil-97613600-1' \
  | xmlstarlet format --html 2>/dev/null \
  | xmlstarlet select --template --value-of '//a[last()]/@href[.=contains(.,"http://na-update.cedock.com")]' -n
Cyrus
  • 84,225
  • 14
  • 89
  • 153
  • Thank you, it worked before. Now a new link has been added to that thread and the script doesnt work. The output is empty instead being now the last link: http: //na-update.cedoc..T02-LF1V356.015216 .zip http://na-update.cedock.com/apps/resource2/V8R851T02/V8-R851T02-LF1V356/FOTA-OTA/V8-R851T02-LF1V356.015216.zip And I am not able to find the error. – FaserF Jul 27 '20 at 09:38
  • Replace `//html/body/div/div[10]/div[2]/div[1]/div[2]/a[last()]/@href` with `//html/body/div/div[8]/div[2]/div[1]/div[2]/a[last()]/@href`. HTML pages are a moving target. – Cyrus Jul 27 '20 at 10:30
  • I've updated my answer with an alternative solution. – Cyrus Jul 27 '20 at 14:31