0

I have a bunch of XML files that have been corrupted. Within them is still some uncorrupted data. Here is a picture of what I am talking about:

Screenshot

I want to iterate through each of the files with python and grab the un-corrupted data, but when I open the file with python:

i2 = open(x + '/' + i, 'r')

It opens the file, but when I try to read through it, it comes back to me saying that only this was read from the file:

'\xa8\x9f\xb0\xdb\x17\xa1\t&}4U\xccsr\xcfN\x7fS\xa1C\xb5\xa4\xd6a\x84i'

I've tried a few different encodings, but it keeps coming back with an error:

i2 = open(x + '/' + i, 'r', encoding='utf8')
i2 = open(x + '/' + i, 'r', encoding='ANSI')

Please let me know if you know why Python is not reading this file correctly.

Cody Brown
  • 1,409
  • 2
  • 15
  • 19
  • How is Python supposed to know which part of the file is corrupted and which is not? It just sees random bytes. – poke Jul 06 '15 at 19:31
  • 2
    You might try the `strings(1)` program. If you *do* use python, open it in `rb` mode and work with `bytes` instead of using an encoding. – o11c Jul 06 '15 at 19:32
  • In the middle of my screen shot there are about 10 lines that have dates and log specific bits of information. I know the key that I want python to find in each line. The key above would be 02/27/2009. If thats in the line then I want to save it. – Cody Brown Jul 06 '15 at 19:32
  • @o11c frig that got it. Man I am stupid sometimes, add it as an answer. – Cody Brown Jul 06 '15 at 19:34

1 Answers1

1

You won't be able to read that file in text mode. No doubt when you try to do so there's a zero byte that python is seeing as a terminator.

Try opening it in binary mode, mode='rb', and avoid the read functions that assume the content to be text like readline(). There's a stackoverflow question already covering reading binary files:

Reading a binary file with python

You'll have to extract the "uncorrupted" parts by checking the binary values byte-by-byte and saving the contiguous valid bytes (ASCII, or UTF-8 I assume) into strings to then print out or write to another file.

Community
  • 1
  • 1
Corbell
  • 1,283
  • 8
  • 15