How to iterate over csv rows to extract text from URLS using pandas

Question

I have a csv of a bunch of news articles, and I'm hoping to use the newspaper3k package to extract the body text from those articles and save them as txt files. I want to create a script that iterates over every row in the csv, extracts the URL, extracts the text from the URL, and then saves that as a uniquely named txt file. Does anyone know how I might do this? I'm a journalist who is new to Python, sorry if this is straightforward.

I only have the code below. Before figuring out how to save each body text as a txt file, I figured I should try and just get the script to print the text from each row in the csv.

import newspaper as newspaper
from newspaper import Article
import sys as sys
import pandas as pd

data = pd.read_csv('/Users/alexfrandsen14/Desktop/Projects/newspaper3k- 
 scraper/candidate_coverage.csv')

data.head()

for index,row in data.iterrows():
    article_name = Article(url=['link'], language='en')
    article_name.download()
    article_name.parse()
    print(article_name.text)

Are all the url's in the same column? You seem to also be missing some python fundamentals here with your code.. you never call index or row in your script. Just because you tell python to enter a for loop doesn't mean it will do anything with the variables. — d_kennetz, Feb 06 '19 at 22:40
Yes, all the url's are in the same column titled "link". And ah, thanks for the pointer. This is one of my first real forays into python so I'm not familiar with the fundamentals still, trying to learn as much intro stuff as I can though. — Alex F, Feb 06 '19 at 22:47

score 2 · Accepted Answer · answered Feb 06 '19 at 23:13

Since all the url's are in the same column, it is easier to access that column directly with a for loop. I will go over some explanation here:

# to access your specific url column
from newspaper import Article
import sys as sys
import pandas as pd

data = pd.read_csv('/Users/alexfrandsen14/Desktop/Projects/newspaper3k-scraper/candidate_coverage.csv')

for x in data['url_column_name']: #replace 'url_column_name' with the actual name in your df 
    article_name = Article(x, language='en') # x is the url in each row of the column
    article.download()
    article.parse()
    f=open(article.title, 'w') # open a file named the title of the article (could be long)   
    f.write(article.text)
    f.close()

I have not tried this package before, but reading the tutorial posted this seems like it should work. Generally, you are accessing the url column in your dataframe by the line: for x in data['url_column_name']: you will replace the 'url_column_name' with the actual name of the column.

Then, x will be the url in the first row so you will pass that to Article (you don't need brackets around x judging by the tutorial). It will download this first x and parse it, then open a file with the name of the title of the article, write the text to that file, then close that file.

It will then do this same thing for the second x, and third x, all the way until you run out of urls.

I hope this helps!

How to iterate over csv rows to extract text from URLS using pandas

1 Answers1