How to extract data (user name)from webpage

Question

I want to collect user names from member-list pages like this: http://www.marksdailyapple.com/forum/memberslist/

I want to get every username from all the pages,

and I want to do this in linux,with bash

where should I start,could anyone me some tips?

You have to try something yourself. Try `curl`. But better use a real HTML parser in a language like perl (WWW::Mechanize + HTML::TreeBuilder::Xpath), python or ruby. — Gilles Quénot, Oct 26 '13 at 19:06

BeniBela · Accepted Answer · 2013-10-27T13:00:09.887

7

This is what my Xidel was made for:

xidel http://www.marksdailyapple.com/forum/memberslist/ -e 'a.username'  -f '(//a[@rel="Next"])[1]'

With that simple line it will parse the pages with a proper html parser, use css selectors to find all links with names, use xpath to find the next page and repeat it until all pages are processed

You can also write it using only css selectors:

xidel http://www.marksdailyapple.com/forum/memberslist/ -e 'a.username'  -f 'div#pagination_top span.prev_next a'

Or pattern matching. There you basically just copy the html elements you want to find from the page source and replace the text content with {.}:

xidel http://www.marksdailyapple.com/forum/memberslist/ -e '<a class="username">{.}</a>*'  -f '<a rel="next">{.}</a>'

edited Oct 27 '13 at 13:00

answered Oct 26 '13 at 23:08

BeniBela

16,412
4
45
52

Oh! I like this, great job! – Patrik Martinsson Oct 26 '13 at 23:21
nice tool,can you expain what "a.username" and "'(//a[@rel="Next"])[1]'" mean to me,please?it doest seem to work for the page I want to extract data from. – erical Oct 27 '13 at 09:07
`a.username` is a css selector to select all links with class `username`. `(//a[@rel="Next"])[1]` is a XPath expression to select the first link with attribute `rel="Next"`. `-e` means it should extract the selection (print it to stdout), `-f` that it should follow a link. And you can use css selectors or xpath for both options and it should automatically detect, which kind of expression you have used – BeniBela Oct 27 '13 at 10:35
actually,I don't know much about html and css,so it's too hard for me to figure out the extract expression by myself,and I want to use that xidelscript.But after I select the 1st ID "#thernrem1950",the css expression displayed in the result box is a little differant from yours,which is "TABLE#memberlist_table TR TD.username A.username",can you tell me why ,please?and I don't know how to get the "follow expression" by using xidel…… – erical Oct 27 '13 at 11:36
If a css selectors contains spaces, it means it contains multiple selectors each applying only of the descendants of the results of the last. So your selector first selects `TABLE#memberlist_table` then `TR` then `TD` and finally `A.username` which is the same as mine. The follow expression selects the [next button](http://www.marksdailyapple.com/forum/images/pagination/next-right.png). You can also use css for that: `-f "div#pagination_top span.prev_next a"`. (just `-f "span.prev_next a"` does not work, because Xidel fails to detect it as css selector due to the `_`) – BeniBela Oct 27 '13 at 12:54
Thank you for explanning for me with so much patience,I think I get how the extract expression works in this case now,but I'm still a little confused about the follow expression in this case:"div#pagination_top span.prev_next a",in the memberlist page 1,these is only one link (the "next" one) that can apply to that expression,but when it comes to page 2,there are 2 links ("previous" and "next"),how does it judge at that time? – erical Oct 28 '13 at 08:43
and can it output the result to a text file? – erical Oct 28 '13 at 10:25
It only visits every page once. Since it was already on the previous page, that is skipped. And you can just redirect the output with `> textfile` – BeniBela Oct 28 '13 at 10:46
Sorry to bother you again,I'm wondering if Xidel can be used to download articles,all articles,from someone's blog,like this?http://www.rafabenitez.com/web/in/blog/4/ – erical Nov 04 '13 at 08:54
Yes. E.g. `xidel http://www.rafabenitez.com/web/in/blog/4/ -f 'css("a.tituloEntrada") / resolve-uri(@href, //base/@href)' --download '{extract($url, "/([0-9]+)/$", 1)}.html'` (resolve-uri is needed, because the page uses a ``-element to have links relative to another url instead its own url, and xidel does not handle that automatically. extract just saves it with nicer names) – BeniBela Nov 09 '13 at 17:32

score 2 · Answer 2 · edited May 23 '17 at 12:15

2

First you should use wget to get all the username pages. You will have to use some options (check the man page for wget) to make it follow the right links, and ideally not follow any of the uninteresting links (or failing that, you can just ignore the uninteresting links afterwards).

Then, despite the fact that Stackoverflow tells you not to use regular expressions to parse HTML, you should use regular expressions to parse HTML, because it's only a homework assignment, right?

If it's not a homework assignment, you've not chosen the best tool for the job.

edited May 23 '17 at 12:15

Community

1
1

answered Oct 26 '13 at 19:15

Robin Green

32,079
16
104
187

Thank you,and it's not a homework assignment,I just want to do it with bash. – erical Oct 26 '13 at 19:30
But why do you want to do it with bash?! – Robin Green Oct 26 '13 at 19:32
I'm learning linux,so……（but I wont refuse to learn a better method^^）and after Idownload the webpage using wget,what command can I use to get all the user IDs from that page?Is grep OK?Can you help me straithen my thinking in this part? – erical Oct 26 '13 at 19:47
`grep` _might_ work for these specific pages, but I suggest using a real programming language and a real HTML parser would be more able to cope with different types of user list pages. But if you really want to use `grep`, you should view the source of the HTML page and design a regular expression to match the `a` elements containing the usernames. Hint: they are helpfully given the CSS class `username`. – Robin Green Oct 26 '13 at 19:56
what language?what HTML parser?can you recommend one,please?and what do you mean in this Hint"they are helpfully given the CSS class username",can you take this page or else for example? – erical Oct 26 '13 at 20:23
Do you know what a CSS class is? – Robin Green Oct 26 '13 at 20:25

Patrik Martinsson · Answer 3 · 2013-10-27T14:19:09.387

As Robin suggest, you should really do this kind of stuff within a programming language containing a decent html-parser. You can always use command-line tools do various tasks, however in this case I probably would have chosen perl.

If you really want to try to do it with command-line tools i would suggest, curl, grep, sort and sed.

I always find it easier when I have something to play with, so here's something to get you started.
I would not use this kind of code to produce something useful though, but just so you could get some ideas.

The memberpages seems so be xxx://xxx.xxx/index1.html, where the 1 is indicating the page-number. Therefore the first thing I would do is to extract the number of the last memberpage. When I have that I know which urls I want to feed curl with.

Every username is in a member of the class "username", with that information we can use grep to get the relevant data.

#!/bin/bash 
number_of_pages=2
curl http://www.marksdailyapple.com/forum/memberslist/index[1-${number_of_pages}].html --silent | egrep 'class="username">.*</a>' -o | sed 's/.*>\(.*\)<\/a>/\1/' | sort

The idea here is to give curl the addresses in the format index[1-XXXX].html, that will make curl traverse all the pages. We then grep for the username class, pass it to sed to extract relevant data (the username). We then pass the produced "username-list" to sort to get the usernames sorted. I always like sorted things ;)

Big Notes though,

You should really be doing this in another way. Again, I recommend perl for these kind of tasks.
There is no errorchecking, validaton of usernames, etc, etc. If you should use this in some sort of production there are no shortcuts, do it right. Try to read up on how to parse webpages in different programming languages.
By purpose I declared number_of_pages to two. You'll have to figure out a way bý yourself to get the number of the last memberpage. It was a lot of pages though, and i imagine it would take some time to iterate through them.

Hope that helps !

you advice is really help you,thank you,but I know as little about perl now,maybe I'll try to learn it later — erical, Oct 27 '13 at 12:42

score 1 · Answer 4 · answered Oct 27 '13 at 00:29

I used this bash script to go through all the pages:

#!/bin/bash

IFS=$'\n'
url="http://www.marksdailyapple.com/forum/memberslist/"
content=$(curl --silent -L ${url} 2>/dev/null | col -b)
pages=$(echo ${content} | sed -n '/Last Page/s/^.*index\([0-9]\+\).*/\1/p' | head -1)
for page in $(seq ${pages}); do
    IFS=
    content=$(curl --silent -L ${url}index${page}.html 2>/dev/null | col -b)
    patterns=$(echo ${content} | sed -n 's/^.*class="username">\([^<]*\)<.*$/\1/gp')
    IFS=$'\n' users=(${patterns})
    for user in ${users[@]}; do
        echo "user=${user}."
    done
done

How to extract data (user name)from webpage

4 Answers4