-1

So i have a bunch of HTML files that i would like to fix the markup on with the help of bs4. But once i run the code, all files are just empty (lucky my i made a backup before running my script on the folder).

This is what i have so far:

from bs4 import BeautifulSoup
import os
for entry in os.scandir(path):
    if entry.is_file() and entry.path.endswith('html'):
        file = open(entry.path, 'w+')
        soup = BeautifulSoup(file, 'html.parser')
        file.write(soup.prettify())
        print(colored('Success', 'green'))
        file.close()

The expected result would be that the file is read, prettyfied and saved.

petezurich
  • 9,280
  • 9
  • 43
  • 57
Adam
  • 1,231
  • 1
  • 13
  • 37

3 Answers3

0

you have truncated the files with the access modifier of +w. Take a look at this answer here which explains in detail which mode you require.

More information from the python docs can be found here for 2.7 and for python3

Saif Asif
  • 5,516
  • 3
  • 31
  • 48
  • Alright but i want to truncate the file, and then save it with new content (prettyfied) – Adam Jan 15 '20 at 17:03
  • then you require to read the file contents first so in `r` only mode and then create another new file `w` mode to dump the new contents – Saif Asif Jan 15 '20 at 17:04
0

opening the file with "W +" you delete what's in it before you can read. Solution:

from bs4 import BeautifulSoup
import os
for entry in os.scandir(path):
    if entry.is_file() and entry.path.endswith('html'):
        readFile = open(entry.path, 'r')
        soup = BeautifulSoup(readFile, 'html.parser')
        readFile.close()
        writeFile = open(entry.path, 'w')
        writeFile.write(soup.prettify())
        writeFile.close()
        print(colored('Success', 'green'))


0

You've used the 'w+' mode to open the file. This clears/ truncates all file content.

Use 'r' to read file contents, then process them, and use 'w+' to overwrite the file with the processed contents.

from bs4 import BeautifulSoup
import os
for entry in os.scandir(path):
    if entry.is_file() and entry.path.endswith('html'):
        with open(entry.path, 'r') as f:
            readfile = f.read()
        readFile = open(entry.path, 'r')
        soup = BeautifulSoup(readFile, 'html.parser')
        with open(entry.path, 'w+') as f:
            readfile = f.write(soup.prettify())
        print(colored('Success', 'green'))

For more info about modes of opening files in python see these resources:

Excellent StackOverflow answers

Manpagez

Python documentation

Suyash
  • 375
  • 6
  • 18