0

I'm reading data from a file, and sometimes the file contains funky stuff, like:

"䉌Āᜊ»ç‰ç•‡ï¼ƒè¸²æœ€ä²’Bíœë¨¿ä„€å•²ï²ä‹¾é¥˜BéŒé“‡ä„€â²ä‹¾â¢"

I need to strip/replace these characters as JSON has no idea what to do with them.

They aren't control characters (I think), so my current regex of

Regex.Replace(value, @"\p{C}+", string.Empty);

Isn't catching them.

A lot of these strings read in are going to be long, upwards of256 characters, so I'd rather not loop through each char checking it.

Is there a simple solution to this? I'm thinking regular expressions would solve it, but I'm not sure.

Community
  • 1
  • 1
Tony
  • 3,587
  • 8
  • 44
  • 77
  • 3
    It looks like you're reading the file using the wrong encoding. –  Nov 13 '18 at 16:24
  • This is what I thought, but 99% of the strings read are valid, these happen maybe one in every 1000 reads. It's more than likely data corruption, as these are the remnants of log files. – Tony Nov 13 '18 at 16:25
  • 1
    If you look, that's the answer that I found the regex I'm using in :) – Tony Nov 13 '18 at 16:27
  • 1
    When you encounter garbage data like this, is it the entire file, or just parts of a file? Also, I think your title is inaccurate, as these are printable characters. –  Nov 13 '18 at 16:39
  • Please define the chars you want to remove. It is also a good idea to paste the text as text to the question itself. If you want to only keep printable ASCII, try `Regex.Replace(s, "[^ -~]+", "")` – Wiktor Stribiżew Nov 13 '18 at 16:41
  • It's just parts of the file, these are remnants of old, circular log files. Good point, I'll update the title. – Tony Nov 13 '18 at 16:41
  • 1
    See https://regex101.com/r/E5h1LS/1 – Wiktor Stribiżew Nov 13 '18 at 16:52
  • Thanks Wiktor. I'll give that a go. – Tony Nov 13 '18 at 16:58
  • It doesn't seem like a good plan to turn someone's text data into garbage and then throw some of the garbage out in hopes of retaining some of the original data. Can you show what might be the problem upstream? – Tom Blodget Nov 13 '18 at 23:41

1 Answers1

0

If all you want is ASCII then you could do:

Regex.Replace(value, @"[^\x00-\x7F]+", string.Empty);

and if all you want are the "normal" ASCII characters, you could do:

Regex.Replace(value, @"[^\x20-\x7E]+", string.Empty);
Woody1193
  • 7,252
  • 5
  • 40
  • 90