0

I'm trying to scrape wikipedia for data on some famous people. I've got no problems getting the data, but when I try to export it to csv there's always a few entries causing a major issue. Basically, the output csv is formatted fine for most entries, except a few that cause random line-breaks that I can't seem to overcome. Here is sample data and code:

# 1. pull out wiki pages
sample_names_list = [{'name': 'Mikhail Fridman', 'index': 11.0}, #will work fine
                     {'name': 'Roman Abramovich', 'index': 12.0}, #will cause issue
                     {'name': 'Marit Rausing', 'index': 13.0}] #has no wiki page, hence 'try' in loops below

# 1.1 get page title for each name in list
import wikipedia as wk

for person in sample_names_list:
    try:
        wiki_page = person['name']
        person['wiki_page'] = wk.page(title = wiki_page, auto_suggest = True)
    except: pass

# 1.2 get page content for each page title in list
for person in sample_names_list:
    try:
        person_page = person['wiki_page']
        person['wiki_text'] = person_page.content
    except: pass

# 2. convert to dataframe
import pandas as pd
sample_names_data = pd.DataFrame(sample_names_list)
sample_names_data.drop('wiki_page', axis = 1, inplace= True) #drop unnecessary col

# 3. export csv
sample_names_data.to_csv('sample_names_data.csv')

Here is a screenshot of the output where, as you can see, random line-breaks are inserted in one of the entries and dispersed throughout with no apparent pattern:

Random linebreaks are inserted in one of the entries with no apparent pattern

I've tried fiddling with the data types in sample_names_list, I've tried messing with to_csv's parameters, I've tried other ways to export the csv. None of these approaches worked. I'm new to python so it could well be a very obvious solution. Any help much appreciated!

mendy
  • 329
  • 1
  • 9

1 Answers1

1

The wikipedia content has newlines in it, which are hard to reliably represent in a line-oriented format such as CSV.

You can use Excel's Open dialog (not just double-clicking the file) and select "Text file" as the format, which lets you choose how to interpret various delimiters and quoted strings... but preferably just don't use CSV for data interchange at all.

  • If you need to work with Excel,use .to_excel() in Pandas.
  • If you need to just work with Pandas, use e.g. .to_pickle().
  • If you need interoperability with other software, .to_json() would be a decent choice.
AKX
  • 152,115
  • 15
  • 115
  • 172
  • I thought it was a newlines issue too, but the breaks aren't happening on newlines, and plenty of newlines are incorporated successfully in the rest of the output. – mendy Jan 11 '21 at 11:30
  • I want to export it for analysis in R (which I'm more familiar with), hence ideally want it as csv. – mendy Jan 11 '21 at 11:30
  • I would suggest JSON in that case, e.g. https://stackoverflow.com/questions/2617600/importing-data-from-a-json-file-into-r – AKX Jan 11 '21 at 11:34