0

I try to load my data from a CSV file using the code below. For some reason it isn't working correctly, because it only load the last loop variables values...

import csv
import newspaper
import pandas as pd
from newspaper import Article
    
df = pd.DataFrame(data, columns=['txt','date1','authors1'])
lista = ['https://www.dawn.com/news/1643189','https://www.dawn.com/news/1648926/former-pakistan-captain-inzamamul-haq-suffers-heart-attack-in-lahore']
    
for list in lista:
    
    first_article = Article(url="%s" % list, language='de')
    first_article.download()
    first_article.parse()
    txt = first_article.text
    date1 = first_article.publish_date
    authors1 = first_article.authors
    data = [[txt,date1,authors1]]
    data = [[txt,date1,authors1]]
    df = pd.DataFrame(data, columns=['txt','date1','authors1'])
    df.to_csv('pagedata.csv')
Life is complex
  • 15,374
  • 5
  • 29
  • 58
  • `df.to_csv('pagedata.csv')` overwrites the file on each iteration – ForceBru Sep 28 '21 at 21:36
  • Does this answer your question? [Import multiple csv files into pandas and concatenate into one DataFrame](https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe) – psychemedia Sep 29 '21 at 08:47

1 Answers1

1

This is by design! You are overwriting your output. Try something like:

f = pd.DataFrame(data, columns=['txt','date1','authors1'])
lista = ['https://www.dawn.com/news/1643189','https://www.dawn.com/news/1648926/former-pakistan-captain-inzamamul-haq-suffers-heart-attack-in-lahore']

for i, list in enumerate(lista):

   first_article = Article(url="%s" % list, language='de')
   first_article.download()
   first_article.parse()
   txt = first_article.text
   date1 = first_article.publish_date
   authors1 = first_article.authors
   data = [[txt,date1,authors1]]
   df = pd.DataFrame(data, columns=['txt','date1','authors1'])
   df.to_csv(f"page_{i}_data.csv2")

edit: apparently you want to collate your data. Something like:

df = pd.DataFrame(columns=['txt','date1','authors1'])
for row in lista:

   first_article = Article(url=row, language='en')
   first_article.download()
   first_article.parse()
   txt = first_article.text
   date1 = first_article.publish_date
   author1 = first_article.authors
   df.loc[len(df.index)] = [txt, date1, authors1]


df.to_csv("pagedata.csv2")

The main thing is having a variable outside the loop you can append to.

Note that I have corrected the language (these articles are in English, not German), and removed a redundant string replace (url="%s" % url == url=url).

2e0byo
  • 5,305
  • 1
  • 6
  • 26
  • it will give me 2 CSV files but I want data in one file with a proper data frame. for example in each iteration, it will store data in the same file with creating a new row. data store in column-wise as name the column (txt, date1, authors1). the text data store in text column and date store in date column and so on... – khawar maqsood Sep 28 '21 at 22:02
  • Ah! you didn't exactly say that in the question.... – 2e0byo Sep 28 '21 at 22:08
  • yes, I think soo.. hope so now you help me out thanks – khawar maqsood Sep 28 '21 at 22:16
  • the edited program give dataless dataframe – khawar maqsood Sep 28 '21 at 22:29
  • @khawarmaqsood apologies, I don't use Pandas very much and naively thought I could append to it and it would do the right thing. I have now *run* the answer and it works. the principle is the same however: keep a var *outside* your loop and append to that. – 2e0byo Sep 29 '21 at 10:07