0

I am trying to process a very massive xml file. At some point, it seem to contain some weird character that provokes the processing script fail.

I'd like to see what is in that given line, but Python (Python 3.6.9) says the line is a negative one:

xml.parsers.expat.ExpatError: not well-formed (invalid token): line -1503625011, column 60

I assume that the line number is negative because it is above the max integer value.

How can I "convert" this negative number to a positive number, so I can feed it to head file -n (number) | tail -n1 in order to isolate that faulty line?

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
motagirl2
  • 589
  • 1
  • 10
  • 21
  • Why not just `print(line)` and then the last printed line before the error is your culprit? – Tomerikoo Feb 24 '21 at 09:18
  • @Tomerikoo There is a lot of lines to print, it will make the script take longer. Also, the line might not be suitable for being printed, or the faulty characters being non-printable ones, etc – motagirl2 Feb 24 '21 at 09:20
  • Add a `try/except` and print the line and line number there? – Tomerikoo Feb 24 '21 at 09:23
  • Have you double checked it's not an encoding or BOM (byte order mark) problem? possibly related: https://stackoverflow.com/questions/48821725/xml-parsers-expat-expaterror-not-well-formed-invalid-token – Shameen Feb 24 '21 at 09:30
  • @Shameen No, I've been able to extract a few thousand of millions of lines before the error. – motagirl2 Feb 24 '21 at 09:35
  • 1
    ok. Random guess, but if i convert `-1503625011` into an unsigned 32-bit int i get `2791342285`, is that within your data? – Shameen Feb 24 '21 at 09:44
  • yes! @Shameen Thanks, there is a nasty non printable character there ('\f', or hex 0C) – motagirl2 Feb 24 '21 at 10:34
  • @Shameen Would you post that as an answer, so I can accept and close the question? – motagirl2 Feb 24 '21 at 10:38
  • Also see: https://stackoverflow.com/questions/730133/what-are-invalid-characters-in-xml/ – stop.climatechange.now Feb 24 '21 at 10:44

1 Answers1

1

Looks like its incorrectly using a signed 32-bit int. Converting -1503625011 to an unsigned int gives 2791342285

To 'un-sign' integers like this, see How to convert signed to unsigned integer in python

Note: This would only affect row numbers >= 231 (2,147,483,647)

Shameen
  • 2,656
  • 1
  • 14
  • 21