-1

I downloaded this site https://en.wikipedia.org/wiki/List_of_sovereign_states and I want to extract a list with only countries in it.

I downloaded the whole html in a file named countries.

curl https://en.wikipedia.org/wiki/List_of_sovereign_states >countries

I found that all the countries are after a span id = .... so I tried to search after those using grep -F span id countries

But how can i filter the results with sed ?

My problem is, I do not really understand how grep and sed work together the manualpages are not that good for a beginner and the internet is really not that helpful I hope you can help me.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
Nico
  • 95
  • 1
  • 2
  • 7
  • 3
    You need an HTML parser. – SLaks May 13 '19 at 14:24
  • Depends whether it it is a one-off thing or something you will do regularly. If you do it regularly, then I agree, use a proper HTML parser, because html format is not hard fixed. However good your sed skills 6 months down the line someone will make an edit and your sed script will break in unusual ways. If it is one-off, just load the file into your favourite visual editor or spreadsheet, break it into lines by splitting on a convenient character (eg >), then just do search/replace on the file until you get what you want. – Gem Taylor May 13 '19 at 14:39
  • Google xmlstarlet. For other applications - if you're considering using grep+sed then you should be using just awk instead. – Ed Morton May 13 '19 at 15:02
  • Possible duplicate of [How to get plain text out of wikipedia](https://stackoverflow.com/questions/4452102/how-to-get-plain-text-out-of-wikipedia) – kvantour May 13 '19 at 20:56
  • With grep and sed, like asked for: grep -Po '' countries | sed -nr 's//\1/p' Use | (pipe) to redirect output from one command to another. Some countries missing from command output, because they are not in . – Panta May 14 '19 at 05:32

2 Answers2

1

Do not use grep or sed to parse XML or HTML. If you really want to use a regular expression tester like regex101. But before you do so read this first.

Try this:

xmllint --shell <<<'cat //tr/td[1]/descendant::span[@class="flagicon"]/following-sibling::a[@title]/text()' --html countries 2>/dev/null |
recode html..utf8 |
sort -u |
sed '/^[ /]/d'
ceving
  • 21,900
  • 13
  • 104
  • 178
  • Not all countries are output by that, e.g. Scotland and Greenland are missing since they appear in a different section of the table presumably, but it's a great start! – Ed Morton May 13 '19 at 15:33
  • @EdMorton Scotland is not independent, it is part of the UK. I took just the first column with `td[1]`. – ceving May 14 '19 at 07:22
  • Being Scottish myself, I'm aware of my country's status as stated in that article: `The United Kingdom is a Commonwealth realm[e] consisting of four constituent countries: England, Northern Ireland, Scotland, and Wales.`. Yes, I understand that's what you did, I understand why you did it, and like I said it's a great start. – Ed Morton May 14 '19 at 11:54
0

This might work for you (GNU sed):

sed -nE 's/<td style="vertical-align:top;">.*title[^"]*"([^"(]*)( \([^)]*\))*".*/\1/p' countriesFile

This solution represents the 206 listed states in the table.

potong
  • 55,640
  • 6
  • 51
  • 83