0

Stupid question. I have made my first scraper/crawler. It gives me exactly what i want, but when i write it to csv file, text appears with \n'] brackets. If i try to remove it in any way - it breaks my output in csv file. Although the website is in hebrew, it shouldn't be a problem. Just look at csv that you get. Thanks in advance

import csv
import requests
from bs4 import BeautifulSoup as bs
import io

url='https://www.maariv.co.il/news/politics'
source = requests.get(url).text
soup = bs(source, 'html.parser')


file = io.open('maariv7.csv', 'w', encoding="utf-16")
csv_writer = csv.writer(file, delimiter='|')
csv_writer.writerow(['Headline', 'Summary', 'Text', 'name'])
file.close()  

def single_page_scraper(url):
    source = requests.get(url).text
    soup = bs(source, 'html.parser')
    
    file = io.open('maariv7.csv', 'a', encoding="utf-16")
    csv_writer = csv.writer(file, delimiter='|')
    
    for article in soup.find_all(class_='article-title'):
        headline = article.h1.text
        print (headline,'\n')
    
        for article in soup.find_all(class_='article-description'):
                summary = article.h2.text
                print(summary,'\n')
    
                text=[]
                name=[]
                for par in soup.find_all(class_='article-body'):            
                    text.append(par.get_text())
                    print(text)

                politics = io.open('politicians.txt', 'r', encoding="utf-8")
                my_list=politics.read().splitlines()
                my_file=str(text)            
                for i in my_list: 
                    if i in my_file:
                        name.append(i)

    name_list = ", ".join(name)         
    print(name_list,'\n''\n''\n''\n')           
    csv_writer.writerow([headline, summary, my_file, name_list])
    file.close()   
    
for articles in soup.find_all(class_='three-articles-in-row'):
    link = articles.a['href']  
    single_page_scraper(link)

  • I'm getting an error in this line: politics = io.open('politicians.txt', 'r', encoding="utf-8"), and this file doesn't exist. – Roy2012 Jul 08 '20 at 11:50
  • Without ability to run your program it's hard to see what's going on. Maybe `csv_writer.writerow([headline.strip(), summary.strip(), my_file.strip(), name_list.strip()])` will help? – Andrej Kesely Jul 08 '20 at 12:34
  • They are actually putting newlines in their text, so you should strip them right where you append the text: instead of `text.append(par.get_text())` add the strip `text.append(par.get_text())`. – Gregor Jul 08 '20 at 14:05

2 Answers2

0

Check out Yibo Yang's answer at the bottom.

Basically, try switching this line:

    csv_writer = csv.writer(file, delimiter='|')

to this:

    csv_writer = csv.writer(file, delimiter='|', newline='')

And see if it makes a difference.

mark_s
  • 466
  • 3
  • 6
0

So, inside of single_page_scraper I use They are actually putting newlines in their text, so you should strip them right where you append the text: instead of text.append(par.get_text()) add the strip text.append(par.get_text())

for par in soup.find(class_='article-body'):
                    if isinstance(par, NavigableString):
                        t = par.strip()
                    else:
                        t = par.text.strip()
                    if t != '':
                        text.append(t)

edit: you would have to from bs4 import NavigableString

Gregor
  • 588
  • 1
  • 5
  • 19