-3

This may have been already answered, if so, please direct me to that solution page with a link.

What I have is a file which has details of the 100 largest countries by total area (land and water surface):

('1','Russia','17,098,242(6,601,668)','Asia/Europe','Azerbaijan, Belarus, China, Estonia, Finland, Georgia, Kazakhstan, Latvia, Lithuania, Mongolia, North Korea, Norway, Poland, Ukraine')
('2','Canada','9,984,670(3,855,100)','North America','United States')
('3','United States(incl. overseas territories)','9,857,348(3,805,943)','North America','Canada, Mexico')
('4','China','9,596,961(3,705,407)','Asia','Afghanistan, Bhutan, India, Kazakhstan, Kyrgyzstan, Laos, Mongolia, Myanmar, Nepal, North Korea, Pakistan, Russia, Tajikistan, Vietnam')
('5','Brazil','8,515,770(3,287,957)','South America','Argentina, Bolivia, Colombia, France (French Guiana), Guyana, Paraguay, Peru, Suriname, Uruguay, Venezuela'), 
....
....

And yes, the input file has ( & ) in the beginning and end of the line.

Any help will be really appreciated.

So far, I was trying to get this by writing:

onlyCountries = 'allcountries.txt'
print([x.split(',')[1] for x in open(onlyCountries)])

But that gives me output as:

["'Russia'", "'Canada'", "'United States(incl. overseas territories)'", "'China'", "'Brazil'"...]

Notice that extra double quotes that I get from the input file sample I gave above? I would like to get output as:

['Russia','Canada','United States','China','Brazil',....]
Mohammad Yusuf
  • 16,554
  • 10
  • 50
  • 78
  • read line-by-line, `split(',')` every line to get list of elements and then get second element - `[1]` - it will be country name which you can add to list of all countries. – furas Jan 02 '17 at 13:53
  • what is the format of the file entries ? is it how it is shown above with `(` and `)`? – Sarath Sadasivan Pillai Jan 02 '17 at 14:00
  • 1
    It looks like you want us to write some code for you. While many users are willing to produce code for a coder in distress, they usually only help when the poster has already tried to solve the problem on their own. A good way to demonstrate this effort is to include the code you've written so far, example input (if there is any), the expected output, and the output you actually get (output, tracebacks, etc.). The more detail you provide, the more answers you are likely to receive. Check the [FAQ](http://stackoverflow.com/tour) and [How to Ask](http://stackoverflow.com/questions/how-to-ask). – TigerhawkT3 Jan 02 '17 at 14:13
  • @TigerhawkT3, my apologies, I didn't mean that, someone write code for me. I have updated the description with what I was trying. This is my first question, I will definitely keep the instructions in mind in future. Thank you for your kind reminder. – user7365492 Jan 02 '17 at 15:12
  • i think i got my answer here: http://stackoverflow.com/a/21626718/7365492 thank you for all your responses. I will see how to close this question thread so no one has to answer anymore on this. – user7365492 Jan 02 '17 at 15:14

2 Answers2

2

You can get it with pandas like this:

import pandas as pd

df = pd.read_html("https://www.countries-ofthe-world.com/largest-countries.html" ,header=0, index_col=0)[0]
clist = df.Country.str.replace(r"\(.*", "").tolist()
print clist

Output:

[u'Russia', u'Canada', u'United States ', u'China', u'Brazil', u'Australia ', u'India', u'Argentina', u'Kazakhstan', u'Algeria', u'Democratic Republic of the Congo', u'Denmark ', u'Saudi Arabia', u'Mexico', u'Indonesia', u'Sudan', u'Libya', u'Iran', u'Mongolia', u'Peru', u'Chad', u'Niger', u'Angola', u'Mali', u'South Africa', u'Colombia', u'Ethiopia', u'Bolivia', u'Mauritania', u'Egypt', u'Tanzania', u'Nigeria', u'Venezuela', u'Namibia', u'Mozambique', u'Pakistan', u'Turkey', u'Chile', u'Zambia', u'Myanmar', u'Afghanistan', u'France ', u'Somalia', u'Central African Republic', u'South Sudan', u'Ukraine', u'Madagascar', u'Botswana', u'Kenya', u'Yemen', u'Thailand', u'Spain', u'Turkmenistan', u'Cameroon', u'Papua New Guinea', u'Sweden', u'Uzbekistan', u'Morocco', u'Iraq', u'Paraguay', u'Zimbabwe', u'Japan', u'Germany', u'Republic of the Congo', u'Finland ', u'Vietnam', u'Malaysia', u'Norway ', u"Cote d'Ivoire", u'Poland', u'Oman', u'Italy', u'Philippines', u'Ecuador', u'Burkina Faso', u'New Zealand ', u'Gabon', u'United Kingdom ', u'Guinea', u'Uganda', u'Ghana', u'Romania', u'Laos', u'Guyana', u'Belarus', u'Kyrgyzstan', u'Senegal', u'Syria', u'Cambodia', u'Uruguay', u'Suriname', u'Tunisia', u'Nepal', u'Bangladesh', u'Tajikistan', u'Greece', u'Nicaragua', u'North Korea', u'Malawi', u'Eritrea']
Mohammad Yusuf
  • 16,554
  • 10
  • 50
  • 78
0
countries = []
with open('text.txt', 'r') as f:
    for line in f.readlines():
        country = line.split(',')[1]
        countries.append(country)
print(countries)
metmirr
  • 4,234
  • 2
  • 21
  • 34
  • @furas good to know that. – metmirr Jan 02 '17 at 14:26
  • 1
    `for line in f` keeps only one line in memory - it reads one line and executes code inside `for` and then it reads next line and again executes code inside `for`, etc.. `f.readlines()` first read all lines into memory, and then `for` gets line-by-line from memory. – furas Jan 02 '17 at 14:33