-1

I want to create a script that takes words/lines from a web page and writes them to a text file. The goal is to save the version number and update date of apps on Google Play. I tried several solutions, but didn't get there.

First attempt:

content=$(wget link -q -O -)
echo $content >> $HOME/Desktop/App\ Version.txt

Problem: Here only the HTML source code was written to a file in a single line.

Second attempt:

(Here I found out that the version number of all apps is in the <div class="reAt0">X.X</div> of the HTML source code. The date is in <div class="xg1aie">XX.XX.XX</div>.)

wget link -O $HOME/Desktop/App\ Version.html
tag=div class="reAt0"
tag2=div
sed -n "/<$tag>/,/<\/$tag2>/p" $HOME/Desktop/App\ Version.html

Problem: Here is the best result as far as the HTML source code is concerned. But the problem is that all the HTML source code is written in the terminal. And when I read in the HTML file, then both div classes became the following [[["Versionsnumber"]].

Example:

<div class="reAt0">1.0</div>

becomes

[[["1.0"]]

Third attempt:

curl -o $HOME/Desktop/App\ Version.txt Link
cat $HOME/Desktop/App\ Version.txt | grep "<xg1aie>" | sed 's/<[^>]*>//g'

Problem: cat does not work because of the div problem as previously reported.

Unfortunately, I am not so familiar with scripts. The goal should be a script that writes the date and version number of several apps to a single text file.

Lauren Yim
  • 12,700
  • 2
  • 32
  • 59
  • Please mention the url and the very thing you're trying to scrape from it. And please DO NOT use `sed` to parse HTML. It's not designed for that. Use a proper HTML-parser instead! – Reino Aug 27 '22 at 10:44
  • I want to scratch the verion number and update date from the Steam app. The version & date is in the "About this app" dialog. The current version as well as from the Steam app is 2.3.13 and was updated on Jun 1, 2021. Here is the link: https://play.google.com/store/apps/details?id=com.valvesoftware.android.steam.community – Monstanner Aug 27 '22 at 14:54

2 Answers2

0

Can you use a tool that was intended for parsing html? Something like xmllint may be suitable for xpath query of your html. Note that your xpath query will have to be based on the html that you are parsing.

xmllint --html --xpath 'concat("version: ", //body/div[@class="reAt0"]/text(), " date: ", //body/div[@class="xg1aie"]/text())' - <<EOF
> <html>
> <body>
> <div class="reAt0">X.X</div>
> <div class="xg1aie">XX.XX.XX</div>
> </body>
> </html>
> EOF

Output:

version: X.X date: XX.XX.XX

The above example uses a heredoc for a dummy html file, but will work the same if parsing an .html file.

xmllint --html --xpath 'concat("version: ", //body/div[@class="reAt0"]/text(), " date: ", //body/div[@class="xg1aie"]/text())' sample.html

Output:

version: X.X date: XX.XX.XX
j_b
  • 1,975
  • 3
  • 8
  • 14
  • First of all, thank you very much for the answer. This one of you works as far as described and I understand it. Unfortunately, I made a mistake. I found this out with the div with the examine tool. Of course, there is no output. I looked at the HTML source code again. My described "div problem" is nonsense. Because the number is not in `
    ` but in `
    – Monstanner Aug 22 '22 at 10:09
0

I want to scratch the verion number and update date from the Steam app. The version & date is in the "About this app" dialog. The current version as well as from the Steam app is 2.3.13 and was updated on Jun 1, 2021. Here is the link: play.google.com/store/apps/…

This is one terrible website to scrape and a fragile endeavour at that. One small update on the website on their part and you can start all over again. But still, at the moment with a tool like (an XML/HTML/JSON parser) it is doable. With tools like sed (regex) I think this would be next to impossible. Please see 1732454, 590747 and 6751105 for example on why it's a bad idea to parse HTML with regular expressions.

Okay. It looks like the date is the only thing that you can easily get from a <div>-node (I don't see the version-string in the "About this app"-dialog):

$ xidel -s "https://play.google.com/store/apps/details?id=com.valvesoftware.android.steam.community" \
  -e '//div[@class="xg1aie"]'
Jun 1, 2021

The other stuff (including the date again) can be found in a <script>-node where the text-node is a (rather complicated) "pseudo" JSON:

$ xidel -s "https://play.google.com/store/apps/details?id=com.valvesoftware.android.steam.community" \
  -e '//body/script[15]' \
  --output-node-format=xml

A lot of <script>-nodes have the same nonce-attribute, without any other identifiable attribute, so we just have to select the one we're after; the 15th.

After removing the javascript-code, xidel, as a JSON-parser, can parse the JSON:

$ xidel -s "https://play.google.com/store/apps/details?id=com.valvesoftware.android.steam.community" -e '
  parse-json(
    extract(//body/script[15],"AF_initDataCallback\((.+)\);",1),
    {"liberal":true()}
  )
'

Then you can grab the stuff you want from this JSON and do a string-concatenation, like for instance:

$ xidel -s "https://play.google.com/store/apps/details?id=com.valvesoftware.android.steam.community" -e '
  parse-json(
    extract(//body/script[15],"AF_initDataCallback\((.+)\);",1),
    {"liberal":true()}
  )/(data)(2)(3)/x"{.(1)()} {.(141)(1)()()} ({.(141)(3)()()})"
'
Steam 2.3.13 (Jun 1, 2021)

Also here, it's rather fragile, having to select the 141th element of an array because there's no other identifiable way.

x"..." is xidel's own "extended string syntax". With XPath's concat-filter that would be:

$ xidel -s "https://play.google.com/store/apps/details?id=com.valvesoftware.android.steam.community" -e '
  parse-json(
    extract(//body/script[15],"AF_initDataCallback\((.+)\);",1),
    {"liberal":true()}
  )/(data)(2)(3)/concat(
    .(1)()," ",.(141)(1)()(),
    " (",.(141)(3)()(),")"
  )
'
Steam 2.3.13 (Jun 1, 2021)

As a bonus you can also grab the Epoch timestamp instead:

$ xidel -s "https://play.google.com/store/apps/details?id=com.valvesoftware.android.steam.community" -e '
  parse-json(
    extract(//body/script[15],"AF_initDataCallback\((.+)\);",1),
    {"liberal":true()}
  )/(data)(2)(3)/concat(
    .(1)()," ",.(141)(1)()(),
    " (",.(146)()(2)(1) * duration("PT1S") + dateTime("1970-01-01T00:00:00Z"),")"
  )
'
Steam 2.3.13 (2021-06-01T17:33:25Z)
Reino
  • 3,203
  • 1
  • 13
  • 21
  • That's it. Really very well explained for a newbie. Thumbs up for that. Thanks also for the information about `sed` and the attached explanations. – Monstanner Aug 27 '22 at 22:33