1

I'm very new to Python and I'm trying to code a program to extract text inside html tags (without tags) and write it onto a different text file for future analysis. I referred this and this as well. I came was able to get below code. But how can I write this as a separate function? Something like

"def read_list('file1.txt')

and then do the same scraping? The reason why I'm asking is output of this code (op1.txt) will be used for stemming and then for another calculations afterwards. The output of this code doesn't write line by line as it intends either. Thank you very much for any input!

f = open('file1.txt', 'r')
for line in f:
    url = line
    html = urlopen(url)
    bs = BeautifulSoup(html, "html.parser")
    content = bs.find_all(['title','h1', 'h2','h3','h4','h5','h6','p'])

    with open('op1.txt', 'w', encoding='utf-8') as file:
        file.write(f'{content}\n\n')
        file.close()
blackgreen
  • 34,072
  • 23
  • 111
  • 129
  • file1.txt is the file that contains the list of urls. Once the scraping is done, it should be written to a separate file (op1.txt) – user13178113 Nov 26 '20 at 14:02

1 Answers1

0

Try like this

from urllib.request import urlopen
from bs4 import BeautifulSoup

def read_list(fl):
    with open(fl, 'r') as f:
        for line in f:
            html = urlopen(line.strip()).read().decode("utf8")
            bs = BeautifulSoup(html, "html.parser")
            content = '\n'.join([x.text for x in bs.find_all(['title','p']+[f'h{n}' for n in range(1,7)])])
        
    with open('op1.txt', 'w', encoding='utf-8') as file:
        file.write(f'{content}\n\n')
Wasif
  • 14,755
  • 3
  • 14
  • 34
  • Thank you, this works like a charm. But can you please explain what you have done here? If I want to use the output of this code to the next step of the program where I have to compare it with another file and remove the duplicates, should I do that as a different function, just below this? I'm very new to Python and still learning from the beginning. Any feedback is appreaciated! – user13178113 Nov 26 '20 at 14:12