1

I used gmail api to get the email content and then in node.js I convert it to string.

Buffer.from(dataToDecode, 'base64').toString('utf8')

Then I use regular expression to search dates in the text. e.g. Feb 27, 2019

/[A-Z][a-z]{2} [0-9]{2}, [0-9]{4}/g

It gives no match, but when I console.log the content, the date is present. And then I copy the date to some online decoding tools, it turns out

\xe2\x80\x8c\x46\xe2\x80\x8c\x65\xe2\x80\x8c\x62\xe2\x80\x8c\x20\xe2\x80\x8c\x32\xe2\x80\x8c\x37\xe2\x80\x8c\x2c\xe2\x80\x8c\x20\xe2\x80\x8c\x32\xe2\x80\x8c\x30\xe2\x80\x8c\x31\xe2\x80\x8c\x39\xe2\x80\x8c\x0a

and

\x46\x65\x62\x20\x32\x37\x2c\x20\x32\x30\x31\x39

can both gives to the same 'Feb 27, 2019'. How to use regular expression to capture the first type encoding (i.e. the longer encoding)?

William Wu
  • 473
  • 5
  • 16
  • Maybe the accepted answer @ https://stackoverflow.com/questions/24811008/gmail-api-decoding-messages-in-javascript can help you. – Flo Feb 27 '19 at 14:52
  • 1
    `\xe2\x80\x8c` is UTF-8 for [U+2080 ZERO-WIDTH NON-JOINER](https://www.fileformat.info/info/unicode/char/200c/index.htm). You basically just want to discard those. – tripleee Feb 27 '19 at 14:55

1 Answers1

-1

1. Check the unicode table.

2. Set condition:

UTF-8: Regex Description

\x20: [\s] space

\x2C: [\,] comma

\x30-\x39: [0-9] digit numbers

\x41-\x5A: [A-Z] Uppercase alphabetics

\x61-\x7A: [a-z] lowercase alphabetics

Pattern

String: Feb 27, 2019

Regex: /[A-Z][a-z][a-z]\s\d\d\,\s\d{4}/g

UTF-8: /[\x41-\x5A][\x61-\x7A]{2}\x20[\x30-\x39]+\x2C\x20[\x30-\x39]{4}/g

Regex101 demo

Joseph
  • 80
  • 14