0

I'm trying to read a srt file that is in hebrew. The encoding is supposed to be cp1255 but it is not reading with this one. I can read it with utf-8 but then it do not follow the rules of hebrew language. I want to store the file in cp1255 format after reading it using 'pysubs2' library in python. Is there any way to do this?

van neilsen
  • 547
  • 8
  • 21
  • 1
    related: https://stackoverflow.com/questions/436220/determine-the-encoding-of-text-in-python – bobrobbob Jun 30 '18 at 11:25
  • If it's valid UTF-8 then the problem is probably elsewhere. Can you please [edit] your question to include a (smallish) snippet, ideally with a hex dump of the raw bytes and your best guess as to what actual text it's supposed to represent? See also the [Stack Overflow `character-encoding` tag info](/tags/character-encoding/info) for background and troubleshooting pointers. – tripleee Jun 30 '18 at 20:20

1 Answers1

1

Old question, but figured I'd post in case anyone else is trying to do this. I've done something similar like this below.

import chardet

# Sniff out encoding method
with open(subtitle_input_path, 'rb') as f:
  rawdata = b''.join([f.readline() for _ in range(10)])

# Encoding method and method whitelist
encoding_method = chardet.detect(rawdata)['encoding']
encoding_method_whitelist = ['utf8', 'ascii']

# If encoding method will cause issues, convert it to utf-8
if encoding_method not in encoding_method_whitelist:

  # Read the old file's content
  with open(subtitle_input_path, encoding=encoding_method) as subtitle_file:
    subtitle_text = subtitle_file.read()

  # Convert to utf-8 and write to file
  with open(subtitle_input_path,'w', encoding='utf8') as subtitle_file:
    subtitle_file.write(subtitle_text)
iPzard
  • 2,018
  • 2
  • 14
  • 24