1

My code

lines=[]
with open('biznism.txt') as outfile:
    for line in outfile:
        line = line.strip()
        lines.append(line)

This is what I have in my Jupyter notebook

["\ufeffIf we are all here, let's get started. First of all, I'd like you to please join me in welcoming Jack Peterson, our Southwest Area Sales Vice President.",
 "Thank you for having me, I'm looking forward to today's meeting.",
 "I'd also like to introduce Margaret Simmons who recently joined our team.",
 'May I also introduce my assistant, Bob Hamp.',
 "Welcome Bob. I'm afraid our national sales director, Anne Trusting, can't be with us today. She is in Kobe at the moment, developing our Far East sales force.",

I will use file content for text analytics,this \ufeff will make a hell of a mess. How to get rid of it?

Richard Rublev
  • 7,718
  • 16
  • 77
  • 121
  • 1
    U+FEFF is ZERO WIDTH NO-BREAK SPACE, decimal: 65279, HTML: No visual representation, UTF-8: 0xEF 0xBB 0xBF, block: Arabic Presentation Forms-B, that means, that this symbol is in the file. You can either delete it manually or use regex to ignore non-printable chars – Michal Polovka Nov 07 '18 at 09:59
  • Thanks,encoding helps,but I should take care of non-printable chars. – Richard Rublev Nov 07 '18 at 10:01

1 Answers1

5

You should use the correct encoding to open the file, for example:

with open('biznism.txt', encoding='utf-8-sig') as outfile:

or

with open('biznism.txt', encoding='utf-16') as outfile:
Andreas
  • 2,455
  • 10
  • 21
  • 24