0

I am trying to organize a very large number of .DTA files using the xlrd library.

The first thing I found out was that .DTA files could be exported to excel files just by changing the extension .xls and opening them in excel. It gives a warning when you open it gives an error about a possibly corrupted file, but opens normally otherwise.

the file you are trying to open is in a different format than specified by the file extension. Verify that the file is not corrupted and is from a trusted source before opening the file. Do you want to open the file now?

When in python however, when I try to open the file all I get is an error with no helpful information, which I'm pretty sure is caused by the file extension issue.

File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 1323, in getbof raise XLRDError('Expected BOF record; found 0x%04x' % opcode) XLRDError: Expected BOF record; found 0x5845

I tried my code by cutting and pasting the data into a new excel file and naming it the same thing and it worked, so I'm pretty sure this is the issue, but I have too many files to be able to do this for each one individually.

Is there a better way to solve this? Supressing the error or actually changing the file type and not just its extension somehow?

Ahmed
  • 184
  • 1
  • 2
  • 14
  • Can you open the file with a text editor, delete the everything until the visible first character save the file with a different name and then reopen it with Excel and then if you get no warning open with Python. We had this problem, if I remember the name of the problem I will write an answer but really just open using a read and take all but the first character mystring = open(file).read(), mystring = mystring[1:] – PyNEwbie Feb 10 '14 at 23:57
  • I have hundreds of files to sort, manually changing things is not an option. – Ahmed Feb 11 '14 at 00:11
  • Just do one to confirm whether or not this is the case if it is changing the files will be easy - carefully read my comment – PyNEwbie Feb 11 '14 at 00:13
  • The file does start immediately from a character, but I've tried (manually) putting it into a new file and saving it as a new file, and this fixed the error. Naming it the same thing also solved the problem. Basically whenever the file originated as something that was made in excel, even if data originated elsewhere, it worked. – Ahmed Feb 11 '14 at 00:28

1 Answers1

0

I think there is a Byte Order Mark at the beginning of the file that is not observable but exists. This answer describes how to remove it < converting utf-16 -> utf-8 AND remove BOM>.

Community
  • 1
  • 1
PyNEwbie
  • 4,882
  • 4
  • 38
  • 86
  • I'm not quite sure how to use this code. ``` with open(ff_name, 'rb') as source_file: with open(target_file_name, 'w+b') as dest_file: contents = source_file.read() dest_file.write(contents.decode('utf-16').encode('utf-8')) ``` I replaced ```ff_name``` with the file I'm reading from, and `target_file_name` with the file I'm writing to, and the code executed but I still got an error. Did I use this right? – Ahmed Feb 11 '14 at 01:40
  • It would be better to paste the code into your answer so I can see it. and more importantly paste the error message. – PyNEwbie Feb 11 '14 at 02:47