1

I need to find and replace all unidentified characters in an xml file using notepad++. I don't know the technical term to describe those unidentified characters, probably they cant even be called characters, so i'm attaching an example image:

example case

The stuff between "string" and "/string" is what i need to find. You know: they can't be copied like text because they're not actually text and if i try to copy it here it looks like this:

So how do i find all of them (excluding newlines) and clear (replace with "blank") from the file by using regex?

edit: Encoding >> Convert to UTF-8 does not clear those

edit: I uploaded a sample file to better illustrate the situation here: https://file.io/QsyodE : I need to weed out the unidentified stuff like the ones in "Genre" strings, the ones before the kanji(?) characters. You can't see those stuff if you open the file with a pure text viewer (like notepad) because they are not actually text (and that is why I need to remove them because the fact that they are not text make the original humongous XML file unimportable by iTunes); but you see them when you open it with Notepad++.

Can Celik
  • 131
  • 7
  • Can you examine them at http://r12a.github.io/apps/conversion/? – Wiktor Stribiżew May 24 '16 at 13:18
  • how do you mean? like i said they cant be copied – Can Celik May 24 '16 at 13:25
  • Does this help: http://stackoverflow.com/questions/20889996/notepad-how-to-remove-all-non-ascii-characters-with-regex – Neal May 24 '16 at 14:11
  • no because using [^\x00-\x7F]+ also finds non-ascii characters like é or ü with which i am fine (and dont wanna replace) – Can Celik May 24 '16 at 14:15
  • Does changing the encoding on notepad++ to UTF-8 BOM make them workable? – Neal May 24 '16 at 14:18
  • no that doesn't help either – Can Celik May 24 '16 at 14:19
  • Do you know where it starts? What's the lowest value after the x? – Thomas Ayoub May 24 '16 at 14:39
  • to answer this i need to find all first. anyway i dont think that is relevant, these are what you see in a non-text file, like what you see when you open an exe file with notepad++ – Can Celik May 24 '16 at 14:55
  • Then play with `MM` in `[^\xMM-\x7F]+` until you find the correct value. Without more information we can't help you... – Thomas Ayoub May 24 '16 at 15:02
  • do you realize this is like saying i cant help you because you cant walk to a disabled person? – Can Celik May 24 '16 at 15:56
  • 1
    OP: Since the accented letters are ok with you, what you are asking is not easy since you will probably have to list every odd character you DON'T want in the regex, and replace it with space. But like Thomas Ayoub inferred, there are likely more than one ranges of characters that you can change to space, you just have to look at an ascii code chart or play around with the values of the ctrs to change to space. [Here's an asciitable](http://www.asciitable.com/) for you to figure out what characters you want removed. – Bulrush May 24 '16 at 16:59
  • are we sure they're characters? because they cant be copied even to the search tool of notepad++. Btw, I tried changing [^\xMM-\x7F]+ to [^\xEF-\x7F]+ for example but notepad++ says it's an invalid expression. – Can Celik May 24 '16 at 18:20
  • 1
    @CanCelik Thats because \xEF-x7F is an inverse range (EF is larger than 7F) – JonM May 25 '16 at 08:36
  • Possible duplicate of [Notepad++, How to remove all non ascii characters with regex?](http://stackoverflow.com/questions/20889996/notepad-how-to-remove-all-non-ascii-characters-with-regex) – John Jones Feb 01 '17 at 19:27

1 Answers1

0

The following will not find é or ü but does find xEF XBF xBE

\b[xX][0-9a-fA-F]+\b
tanuki505
  • 23
  • 3