2

I am trying to read an excel file in python without using pandas or xlrd, and I have been trying to convert the results from bytes to utf-8 without any success.

data from xls file

colA    colB    colC
spc     1D0     20190705
spd     1D0     20190705
spe     1D0     20190705
... (goes on for 500k lines)

code

with open(file, 'rb') as f:
    data = f.readlines(1)  # Just to check the first line that is printed out
    print(data[0].decode('utf-8'))

The error I receive is UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

If I were to print data without decoding it, the result is: [b'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00>\x00\x03\x00\xfe\xff\t\x00\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x9e\x00\x00\x00\x9dN\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\xfe\xff\xff\xff\x00\x00\x00\x00\xfeM\x00\x00\x01\x00\x00\x00\xffM\x00\x00\x00N\x00\x00\x01N\x00\x00\x02N\x00\x00\x03N\x00\x00\x04N\x00\x00\x05N\x00\x00\x06N\x00\x00\x07N\x00\x00\x08N\x00\x00\tN\x00\x00\n']

There isn't any reason why I don't want to use pandas or xlrd, I am just trying to parse the data with just the standard libraries if required.

Any thoughts?

jake wong
  • 4,909
  • 12
  • 42
  • 85
  • The error tells there is a specific character in the Excel file that cannot be decoded with 'utf-8'. Try using a different encoder, but still its not known what sort of characters maybe lurking around in the doc. Perhaps, you _should_ give pandas a try: `pd.read_excel(file)` and see what you get. – amanb Jul 08 '19 at 08:10
  • 3
    Excel is a binary format, not plain-text. If you don't want to use `xlrd` or `pd.read_excel`, you'll have to *reimplement* what those libraries do. – lenz Jul 08 '19 at 08:11
  • 1
    Even if you want to parse .xlsx files, which are considerably easier than .xls, you still have quite a bit of work to do. I guess you are doing this as a learning exercise? If so, then I think you should take a look at [this question](https://stackoverflow.com/questions/4886027/looking-for-a-clear-description-of-excels-xlsx-xml-format) to find out where to read about the .xlsx specifications. If you are truly trying to learn about .xls files, I urge you to reconsider. There are plenty of other things you could be learning about that are more useful and less painful. – John Y Jul 12 '19 at 21:30

2 Answers2

2

You need to unzip the xlsx file first, before you can read its contents (assuming that is the format you are using).

pygri
  • 647
  • 6
  • 17
  • 1
    Ideally, you should show some code how to do this (eg. using the std-lib `zipfile` module) and then how to proceed, once the xlsx archive is unpacked (which file to process, how to access the data of a cell etc.) – lenz Jul 08 '19 at 08:14
  • it would probably be wise to wait for a confirmation that xlsx is indeed the format the OP is trying to read before embarking in such an enterprise... – pygri Jul 08 '19 at 08:20
  • See also [this comment in another thread](https://stackoverflow.com/questions/35744613/read-in-xlsx-with-csv-module-in-python/59973648#59973648), presenting a solution to reading an `*.xlsx* Excel file using just standard library functionality. – Eiríkr Útlendi Apr 07 '20 at 05:32
-4

Try this

with open('D:\dew.csv','rt') as f:
#This will print every line one by one 
data = csv.reader(f)
for r in data:
    print(r) 
    f.close()
Baumflaum
  • 749
  • 7
  • 20
whatever
  • 1
  • 1
  • 2
    From the description the OP has given (though they have not been specific), this does not appear to be answering the question posed. Your solution is for a text based file, the OP appears to be struggling with an (assumed) .xls or .xlsx file. – George Crowther Dec 15 '21 at 12:12