0

My function reads multiple .sgm files. I get an error when reading the content from the file speficially at line contents = f.read()

def block_reader(path):
    filePaths = []
    for filename in os.listdir(path):
        if filename.endswith(".sgm"):
            filePaths.append(os.path.join(path, filename))
            continue
        else:
            continue

    for file in filePaths:
        with open(file, 'r') as f:
            print(f)
            contents = f.read()
            soup = BeautifulSoup(contents, "lxml")

    return ["test content"]

Error message

    Traceback (most recent call last):
  File "./block-1-reader.py", line 32, in <module>
    for reuters_file_content in solutions.block_reader(path):
  File "/home/ragith/Documents/A-School/Fall-2020/COMP_479/Assignment_1/solutions.py", line 29, in block_reader
    contents = f.read()
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 1519554: invalid start byte
Daniel Walker
  • 6,380
  • 5
  • 22
  • 45
PolarisRouge
  • 395
  • 5
  • 14
  • 1
    Try this: `with open(path, 'rb') as f:` That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way. More details at:[link](https://stackoverflow.com/a/42340744/8473925) – Hamza Rana Sep 24 '20 at 23:34
  • 1
    @HamzaRana Thanks! It worked. If you write it as a solution I will accept it. – PolarisRouge Sep 24 '20 at 23:54

1 Answers1

1

Try this: with open(path, 'rb') as f: That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way. More details at: this link

Hamza Rana
  • 137
  • 9