I'm trying to scrape wikipedia for data on some famous people. I've got no problems getting the data, but when I try to export it to csv there's always a few entries causing a major issue. Basically, the output csv is formatted fine for most entries, except a few that cause random line-breaks that I can't seem to overcome. Here is sample data and code:
# 1. pull out wiki pages
sample_names_list = [{'name': 'Mikhail Fridman', 'index': 11.0}, #will work fine
{'name': 'Roman Abramovich', 'index': 12.0}, #will cause issue
{'name': 'Marit Rausing', 'index': 13.0}] #has no wiki page, hence 'try' in loops below
# 1.1 get page title for each name in list
import wikipedia as wk
for person in sample_names_list:
try:
wiki_page = person['name']
person['wiki_page'] = wk.page(title = wiki_page, auto_suggest = True)
except: pass
# 1.2 get page content for each page title in list
for person in sample_names_list:
try:
person_page = person['wiki_page']
person['wiki_text'] = person_page.content
except: pass
# 2. convert to dataframe
import pandas as pd
sample_names_data = pd.DataFrame(sample_names_list)
sample_names_data.drop('wiki_page', axis = 1, inplace= True) #drop unnecessary col
# 3. export csv
sample_names_data.to_csv('sample_names_data.csv')
Here is a screenshot of the output where, as you can see, random line-breaks are inserted in one of the entries and dispersed throughout with no apparent pattern:
I've tried fiddling with the data types in sample_names_list
, I've tried messing with to_csv
's parameters, I've tried other ways to export the csv. None of these approaches worked. I'm new to python so it could well be a very obvious solution. Any help much appreciated!