0

I have the task of taking a text file and having it be read by Python as simply a very long string. That is to say, it's not like a csv or tsv, there is no tabular structure to the text file at all, it's just a slew of words. However the text file contains commas and quotes and things to that nature so I'm getting parsing issues.

I have tried:

with open('text_file.txt') as f:
    text_data = f.read().translate(string.punctuation)

This resulted in an error that read: 'charmap' codec can't decode byte 0x9d in position 47: character maps to 'undefined'

I'm not sure if that error was the result of punctuation within the .txt file interfering with the parsing process, or if there were some strange non-Unicode characters that cannot be read. Potentially, I may need a solution that is robust to both of these problems.

If you feel that there are better ways than my simultaneous read/strip punctuation approach to achieve my goal, feel free to suggest alternatives.

Arash Howaida
  • 2,575
  • 2
  • 19
  • 50
  • 1
    This might be related to printing unicode characters to your terminal: http://stackoverflow.com/questions/14284269/why-doesnt-python-recognize-my-utf-8-encoded-source-file – Trey Hunner Dec 23 '16 at 18:52
  • Are there any means or arguments I can pass along the parsing process to just drop a character it doesn't recognize? That way I don't have to have a specific char set for whatever task I happen to have at hand. I'm thinking the `with` handler doesn't actually place any text it reads in the terminal, but I could be wrong. – Arash Howaida Dec 23 '16 at 18:59
  • 1
    Do you know the encoding of the text file? If its utf-8, it may be as easy as `open('text_file.txt', encoding='utf-8')`. – tdelaney Dec 23 '16 at 19:19
  • I know utf-8 is fairly standard these days, but I have many, many text files to parse. Is there a approach that is robust to different encodings, on the off chance that one of these .txt files is actually something else? Nonetheless, your comment solved my issue, if you could post an answer I could give you credit for that. – Arash Howaida Dec 23 '16 at 19:25
  • Unfortunately there is no reliable standard. Some interesting reading is at https://docs.python.org/3.3/howto/unicode.html (search for "BOM") and http://unicodebook.readthedocs.io/guess_encoding.html which is C based but could be written in python. The "Big 3" are UTF-8 (most unixy systems), Windows BOM (0xfffe or 0xfeff at start of the file) and Windows codepage (defines chars 0x80-0xff as "whatever you local code page happens to be" - that is, no hints which language you got). – tdelaney Dec 23 '16 at 19:59
  • Ok thank you for the info, I'm sure I can find a work around. – Arash Howaida Dec 23 '16 at 20:11
  • 1
    The way you use str.translate isn't going to remove punctuation like you want (To test: `"foo, bar. baz".translate(string.punctuation)` returns the original string) and does it make sense to strip punctuation and risk having words merge? Perhaps `re.findall(r'\w+', f.read())` would work. – tdelaney Dec 23 '16 at 20:21

1 Answers1

1

Is looks like your files are encoded but there is no single standard way to detect encodings, so some guess work is required. There are various modules and tools out there to help and I've include a module called chardet to do the work for me.

You also have a problem with how you use str.translate. It needs a translation table (typically built with str.ma ketrans) - your technique won't remove the punctuation. You may be better off using a regular expression to find the words and rebuild a string from there.

from chardet.universaldetector import UniversalDetector
import re

detector = UniversalDetector()
with open('text_file.txt', 'rb') as f:
    for line in f:
        detector.feed(line)
        if detector.done:
            break
detector.close()

with open('text_file.txt', encoding=detector.result['encoding']) as f:
    text = ' '.join(re.findall(r'\w+', f.read())
tdelaney
  • 73,364
  • 6
  • 83
  • 116