How can i get a list from a html file with only countries in it using grep and sed?

Question

I downloaded this site https://en.wikipedia.org/wiki/List_of_sovereign_states and I want to extract a list with only countries in it.

I downloaded the whole html in a file named countries.

curl https://en.wikipedia.org/wiki/List_of_sovereign_states >countries

I found that all the countries are after a span id = .... so I tried to search after those using grep -F span id countries

But how can i filter the results with sed ?

My problem is, I do not really understand how grep and sed work together the manualpages are not that good for a beginner and the internet is really not that helpful I hope you can help me.

Depends whether it it is a one-off thing or something you will do regularly. If you do it regularly, then I agree, use a proper HTML parser, because html format is not hard fixed. However good your sed skills 6 months down the line someone will make an edit and your sed script will break in unusual ways. If it is one-off, just load the file into your favourite visual editor or spreadsheet, break it into lines by splitting on a convenient character (eg >), then just do search/replace on the file until you get what you want. — Gem Taylor, May 13 '19 at 14:39
Google xmlstarlet. For other applications - if you're considering using grep+sed then you should be using just awk instead. — Ed Morton, May 13 '19 at 15:02
Possible duplicate of [How to get plain text out of wikipedia](https://stackoverflow.com/questions/4452102/how-to-get-plain-text-out-of-wikipedia) — kvantour, May 13 '19 at 20:56
With grep and sed, like asked for: grep -Po '' countries | sed -nr 's//\1/p' Use | (pipe) to redirect output from one command to another. Some countries missing from command output, because they are not in . — Panta, May 14 '19 at 05:32

ceving · Answer 1 · 2019-05-14T09:40:03.113

1

Do not use grep or sed to parse XML or HTML. If you really want to use a regular expression tester like regex101. But before you do so read this first.

Try this:

xmllint --shell <<<'cat //tr/td[1]/descendant::span[@class="flagicon"]/following-sibling::a[@title]/text()' --html countries 2>/dev/null |
recode html..utf8 |
sort -u |
sed '/^[ /]/d'

edited May 14 '19 at 09:40

answered May 13 '19 at 15:03

ceving

21,900
13
104
178

Not all countries are output by that, e.g. Scotland and Greenland are missing since they appear in a different section of the table presumably, but it's a great start! – Ed Morton May 13 '19 at 15:33
@EdMorton Scotland is not independent, it is part of the UK. I took just the first column with `td[1]`. – ceving May 14 '19 at 07:22
Being Scottish myself, I'm aware of my country's status as stated in that article: `The United Kingdom is a Commonwealth realm[e] consisting of four constituent countries: England, Northern Ireland, Scotland, and Wales.`. Yes, I understand that's what you did, I understand why you did it, and like I said it's a great start. – Ed Morton May 14 '19 at 11:54

score 0 · Answer 2 · answered May 13 '19 at 17:57

0

This might work for you (GNU sed):

sed -nE 's/<td style="vertical-align:top;">.*title[^"]*"([^"(]*)( \([^)]*\))*".*/\1/p' countriesFile

This solution represents the 206 listed states in the table.

answered May 13 '19 at 17:57

potong

55,640
6
51
83

How can i get a list from a html file with only countries in it using grep and sed?

2 Answers2