-1

Orignal question

My initial attempt was to run curl https://stackoverflow.com/users/5825294/enlico and pipe the result into sed/awk. However, as I've frequently read, sed and awk are not the best tools to parse HTML code. Furthermore, the above URL changes if I change my user name.

Oh, this is my quick attempt with sed, written on multiple lines for readability:

curl https://stackoverflow.com/users/5825294/enlico 2> /dev/null | sed -nE '
/title="reputation"/,/bronze badges/{
    /"reputation"/{
        N
        N
        s!.*>(.*)</.*!\1!p
    }
/badges/s/.*[^1-9]([1-9]+[0-9]*,*[0-9]* (gold|silver|bronze) badges).*/\1/p
}'

which prints

10,968
5 gold badges
27 silver badges
56 bronze badge

Obviously this script heavily relies on the peculiar structure of the specific HTML page, the most notable example being that I run N twice because I've verified that the reputation is two lines below the first line in the file containing "reputation".

Update based on the answers

Léa Gris' answer almost answers my question. The missing bit is that I have 5 gold, 27 silver, and 56 bronze badges, not 5, 18, 7.

In this respect, I've noticed that 18 is the is the number of silver badges I have if I don't consider those awarded multilple times, therefore I've played around with jq and discovered that I can query for the award_count beside the rank, and I thought that I could use that to take multiply awarded badges into account. This kind of works, in the sense that running the following (fetch_user_badges is from Léa Gris' answer) generates the correct number of silver badges but the wrong number of bronze badges:

$ fetch_user_badges stackoverflow 5825294 | jq -r '
.items
| map({rank: .rank, count: .award_count})
| group_by(.rank)
| map([[.[0].rank],map(.count) | add])'
[
  [
    "bronze",
    22
  ],
  [
    "gold",
    5
  ],
  [
    "silver",
    27
  ]
]

Is anybody aware of why is that?

Enlico
  • 23,259
  • 6
  • 48
  • 102
  • You can pass the user name as parameter or via the environment. For the parsing, show your attempt at how to do it, so that we have some concrete code to discuss. – user1934428 Apr 07 '21 at 11:13
  • See https://api.stackexchange.com/ – Léa Gris Apr 07 '21 at 11:59
  • @user1934428, here's the concrete code. – Enlico Apr 07 '21 at 12:47
  • @enlico : I see that you already have a good answer on it and an excellent link provided by Léa Gris. You should be settled by now.... – user1934428 Apr 07 '21 at 13:09
  • 1
    _"wrong number of bronze badges"_ - Pagination of the API result (`{"has_more": true}`). Why bother with such a cumbersome method when you can easily parse the html-source of your profile-page? – Reino Apr 12 '21 at 16:40
  • @Reino, are you suggesting to go back to my original approach? – Enlico Apr 12 '21 at 16:42
  • Have you seen Jack Fleeting's answer? An XPath one-liner to parse a website... a no-brainer, if you ask me. – Reino Apr 12 '21 at 17:19
  • @Reino, you could post another answer too, no? – Enlico Apr 12 '21 at 17:36
  • @Reino I was using the wrong API method. API are made for this use precisely. Suggesting to parse HTML is very bad suggestion. HTML is unreliable to parse and even if HTML is strictly conformant, the location and hierarchy of the content you want to parse may change anytime. API result is stable, predictable and documented. – Léa Gris Apr 13 '21 at 23:18

4 Answers4

2

Full example using StackExchange API and jq for parsing the response.

#!/usr/bin/env bash

# This script fetches and prints some user info
# from a stack-site using the stackexchange's API

# Change this to the stackoverflow's numerical user ID

STACK_UID=5825294
STACK_SITE='stackoverflow'
STACK_API='https://api.stackexchange.com/2.2'

API_CACHE=~/.cache/stack_api

mkdir -p "$API_CACHE"

# Get a stack-site user using the stackexchange API and caches the result
# @Params:
# $1: the website (example stackoverflow)
# $2: the numerical user ID
# @Output:
# &1: API Json reply
stack_api::user() {
  stack_site=$1
  stack_uid=$2

  cache_file="${API_CACHE}/${stack_site}-users-${stack_uid}.json"

  yesterday_ref="${API_CACHE}/yesterday.ref"
  touch -d yesterday "$yesterday_ref"

  # Expire cache
  [ "$cache_file" -ot "$yesterday_ref" ] && rm -f -- "$cache_file"

  # Call stack API only if no cached answer
  [ -f "$cache_file" ] || curl \
    --silent \
    --output "$cache_file" \
    --request GET \
    --url "${STACK_API}/users/${stack_uid}?site=${stack_site}"

  # Return cached answer
  zcat --force -- "$cache_file" 2>/dev/null
}

IFS=$'\n' read -r -d '' username reputation bronze silver gold < <(
  # Fetch user from a stack site
  stack_api::user "$STACK_SITE" "$STACK_UID" |

  # Parse the stack_api user data from the JSON response
  jq -r '
.items[0] |
  .display_name,
  .reputation,
  ( .badge_counts |
    .bronze,
    .silver,
    .gold
  )
  '
)

printf 'Badges from UserID %d %s on the %s website:\n\n' \
  $STACK_UID "$username" "$STACK_SITE"
printf 'Réputation: %6d\n' "$reputation"
printf 'Bronze:     %6d\n' "$bronze"
printf 'Silver:     %6d\n' "$silver"
printf 'Gold:       %6d\n' "$gold"

Example output:

Badges from UserID 5825294 Enlico on the stackoverflow website:

Reputation:  11144
Bronze:         56
Silver:         27
Gold:            5
Léa Gris
  • 17,497
  • 4
  • 32
  • 41
  • I see that the numbers of silver and bronze badges are different from what I see. I have verified that the silver badges are counted without duplicates (and that I could retrieve the `.award_count` for each of them and sum them up if I want to get the number stackoverflow shows); but do you know why the bronze badges are 7? I haven't 49 duplicate bronze badges... – Enlico Apr 11 '21 at 07:22
  • Oh, as regards the reputation, I see I can do `jq -r '.items | .[0].user.reputation'` to get it. How do you suggest that I integrate this within your script? – Enlico Apr 11 '21 at 07:50
  • @Enlico Fixed the number of badges, was using the wrong API method – Léa Gris Apr 13 '21 at 17:43
2

as I've frequently read, sed and awk are not the best tools to parse HTML code.

That's right. Instead of repeating what others already have said, I'd say; have a look at:

Too bad that last website is rather outdated, because to parse an HTML-source I would pick the Swiss knife tool anytime!

HTML-source

$ xidel -s "https://stackoverflow.com/users/5825294" -e '
  normalize-space(//div[@class="flex--item md:fl-auto"][1]),
  //div[@class="d-flex ai-center mb12"]/normalize-space(div[@class="flex--item fl1"])
'
14,999 reputation
5 gold badges
31 silver badges
68 bronze badges

Furthermore, the above URL changes if I change my user name.

As you can see, "https://stackoverflow.com/users/5825294" works too.
For curl -L, --location would be needed to follow the redirect to "https://stackoverflow.com/users/5825294/enlico". xidel does this automatically.

StackExchange API

The same Swiss knife tool is also a JSON parser:

$ xidel -s "https://api.stackexchange.com/2.2/users/5825294?site=stackoverflow" -e '
  $json/(items)()/(
    reputation||" reputation",
    for $x in reverse((badge_counts)()) return
    join(((badge_counts)($x),$x,"badges"))
  )
'
14999 reputation
5 gold badges
31 silver badges
68 bronze badges

Also see this Xidel online tester for (alternative) intermediate steps.

Reino
  • 3,203
  • 1
  • 13
  • 21
1

There are few ways of doing that; I personally prefer using xpath with a tool like xidel (although you can also use xmlstarlet, etc.)

You can get your reputation score using

xidel https://stackoverflow.com/users/5825294/enlico  -e "//div[@title='reputation']/div/div[@class='grid--cell fs-title fc-dark']/text()"

Similarly, the number of gold medals is obtained using:

xidel https://stackoverflow.com/users/5825294/enlico  -e "//div[@class='grid ai-center s-badge s-badge__gold']//span[@class='grid grid__center fl1']/text()"

Changing the string gold to silver or bronze in that second xpath expression will get you the other two categories.

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • How about `-e '(//span[@class="profile-communities--rep-badges"])[1]//@title'`? – Reino Apr 07 '21 at 21:08
  • @Reino As I said in the answer, there are several ways of getting there. Extracting the attribute value of the `title` attribute is another one. However, generally I wouldn't rely on position selectors (i.e., `[1]`, in your comment), even if they work, as they tend to be brittle (as in, the position is more likely to change over time than the node names or their attribute and attribute values). – Jack Fleeting Apr 07 '21 at 22:04
  • Fair enough. In that case `-e '//a[@title="Stack Overflow"]/span[@class="profile-communities--rep-badges"]//@title'` would do as a one-liner. No need for 3 queries for the 3 different badges. – Reino Apr 07 '21 at 22:54
  • @Reino Yes and no; it depends on what exactly you are trying to target; if you want to get the 3 badges in one "pile" - that expression works. However, if you are looking to specifically target just `27` as the number of silver badges (for example, to see if it's higher or lower than someone else's silver badges) you will need to use the more targeted approach of one of the other expressions. Again, there are a few ways to skin the cat - depending on what type of cat you are dealing with... – Jack Fleeting Apr 07 '21 at 23:32
  • Nowhere in OP's post does he state the requirement of being able to compare the output to someone else his amount of badges. I understand OP wants his reputation + amount of badges in one "pile" (which his `sed` workaround is showing) and my one-liner does exactly that. Guess I should've posted my own answer. – Reino Apr 08 '21 at 00:30
  • @Reino You can still post your own answer. I don't know about OP, but I for one will be happy to upvote it since it's a valid approach. – Jack Fleeting Apr 08 '21 at 11:32
0

the age-old wisdom is do not parse HTML with regex, how about

curl https://stackoverflow.com/users/5825294/enlico -s | php -r '$d=new DOMDocument();@$d->loadHTML(stream_get_contents(STDIN));$xp=new DOMXPath($d);foreach($xp->query("//*[@id=\"user-card\"]//*[contains(@title,\"badges\")]") as $foo){echo $foo->getAttribute("title"),PHP_EOL;}echo preg_replace("/\\s+/"," ",$xp->query("//*[@title=\"reputation\"]")->item(0)->textContent);'

5 gold badges
27 silver badges
56 bronze badges
 11,144 reputation

...

hanshenrik
  • 19,904
  • 4
  • 43
  • 89