0

I have an excel file with two columns. One with the URL ID and the other with the URL itself. The task is to extract data from those files and put it in a text file. The name of the text file should be the URL_ID present in the first columns.

headers={
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}
df = pd.read_csv('D:\Arhamdocs\Projects\question\Input.csv')
data = df.URL
name = df.URL_ID
for url in data:
    page = requests.get(url,headers=headers).text
    soup = bs4.BeautifulSoup(page,"html.parser")
    article_title = soup.find('h1',class_='entry-title')
    article = soup.find('div',class_='td-post-content')
    # print(article)
    for i in name:
        file=open('%i.txt'%i,'w')
        for article_body in soup.find_all('p'):
            title = article_title.text
            body = article.text
            file.write(title)
            file.write(body)
        file.close()

I used the following code,but I always get the article of the last link. Help me out

Arham
  • 3
  • 3

1 Answers1

0
for url, i in zip(data, name):
    page = requests.get(url,headers=headers).text
    soup = bs4.BeautifulSoup(page,"html.parser")
    article_title = soup.find('h1',class_='entry-title')
    article = soup.find('div',class_='td-post-content')
    with open('%i.txt'%i,'w') as fp
        for article_body in soup.find_all('p'):
            title = article_title.text
            body = article.text
            fp.write(title)
            fp.write(body)

I wrote it as in you code but I think the title should be outside the for loop.

Chamesh
  • 8
  • 5
  • Thank you. It worked. Can you please tell me what the zip function does, and what other changes you did? – Arham Sep 18 '22 at 11:09
  • zip function can be used to iterate over multiple list at the same time. `numbersList = [1, 2, 3] str_list = ['one', 'two']` then the `zip(numbersList, str_list)` will return `[(1, one), (2, two)]`. zip function will combine the two lists into least item list is over. `with open(file) as fp:` can be used to manage files easily because you don't need to manually open and close files. check this [link](https://stackoverflow.com/questions/19508703/how-to-open-a-file-through-python) for more info – Chamesh Sep 18 '22 at 14:21
  • Thank you. I was proceeding with the program execution. It worked for the URL's in the loop and then I got this error. 'charmap' codec can't encode character '\u2248' in position 3847: character maps to . This happened to some specific links. Any idea why?? – Arham Sep 18 '22 at 18:19