0

I have a databasedump with appr. 6.0000 lines. They all look like this:

{"student”:”12345”,”achieved_date":1576018800,"expiration_date":1648677600,"course_code”:”SOMECODE,”certificate”:”STRING WITH A LOT OF CHARACTERS”,”certificate_code”:”ABCDE,”certificate_date":1546297200}

"STRING WITH A LOT OF CHARACTERS" is a string with around 600.000 characters (!)

I need those characters on each line removed... I tried with:

sed 's/certificate\":\"*","certificate_code//'

But it seems it did not do the trick.

I also couldn't find an answer to work with here, so reaching out to you, hopefully you can help me.. is this best done with SED? or any other method?

For now I don't care if the all the characters on "STRING WITH A LOT OF CHARACTERS" are removed or replaced by I.E. a 0, even that would make it workable for me ;)

The output for od -xc filename | head is:

0000000    2d2d    4d20    5379    4c51    6420    6d75    2070    3031
          -   -       M   y   S   Q   L       d   u   m   p       1   0
0000020    312e    2033    4420    7369    7274    6269    3520    372e
          .   1   3           D   i   s   t   r   i   b       5   .   7
0000040    322e    2c39    6620    726f    4c20    6e69    7875    2820
          .   2   9   ,       f   o   r       L   i   n   u   x       (
0000060    3878    5f36    3436    0a29    2d2d    2d0a    202d    6f48
          x   8   6   _   6   4   )  \n   -   -  \n   -   -       H   o
0000100    7473    203a    3231    2e37    2e30    2e30    2031    2020
          s   t   :       1   2   7   .   0   .   0   .   1

hope you can help me!

  • `sed 's/\("certificate":"\)[^"]*"/\1"/' file > outputfile`? – Wiktor Stribiżew Mar 10 '20 at 12:02
  • When I hit double-quotes on my keyboard, I get this : " which is ASCII code 34. However, the "double-quotes" in your sample text are not this, but various unicode characters such as https://www.fileformat.info/info/unicode/char/201d/index.htm . If this is what is actually in the file, that would explain why your sed command isn't matching - so to confirsm, could you edit the question to show the first few lines of output of the command `od -xc filename | head` – racraman Mar 10 '20 at 12:21
  • added the output to the main question, thnx – Daniël Dubbeldam Mar 10 '20 at 12:36

2 Answers2

0

When I do the od command on the sample text you've supplied, the output includes :

0000520      454d    4f43    4544    e22c    9d80    6563    7472    6669
           M   E   C   O   D   E   ,   ”  **  **   c   e   r   t   i   f
0000540      6369    7461    e265    9d80    e23a    9d80    5453    4952
           i   c   a   t   e   ”  **  **   :   ”  **  **   S   T   R   I
0000560      474e    5720    5449    2048    2041    4f4c    2054    464f
           N   G       W   I   T   H       A       L   O   T       O   F
0000600      4320    4148    4152    5443    5245    e253    9d80    e22c
               C   H   A   R   A   C   T   E   R   S   ”  **  **   ,   ”
0000620      9d80    6563    7472    6669    6369    7461    5f65    6f63
          **  **   c   e   r   t   i   f   i   c   a   t   e   _   c   o
0000640      6564    80e2    3a9d    80e2    419d    4342    4544    e22c
           d   e   ”  **  **   :   ”  **  **   A   B   C   D   E   ,   ”

So you can see the "quotes" are the byte sequences e2 80 9d, which is unicode U+201d (see https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128 )

Probably the simplest would be to simply skip these unicode characters with the single-character wildcard .

sed "s/certificate.:.*.certificate_code/certificate_code/"

Unfortunately, sed doesn't appear to take the unicode \u201d syntax, so some other answers suggest using the hex sequence (\xe2\x80\x9d) - eg : Escaping double quotation marks in sed (but unfortunately I haven't got that to work just yet, and I have to sign off now)

This answer explains why it could have happened, with some remedial action if that's possible in your situation : Unknown UTF-8 code units closing double quotes

racraman
  • 4,988
  • 1
  • 16
  • 16
0

If you are working with bash, would you please try the following:

q=$'\xe2\x80\x9d'
sed "s/certificate${q}:${q}.*${q},${q}certificate_code//" file

Result:

{"student”:”12345”,”achieved_date":1576018800,"expiration_date":1648677600,"course_code”:”SOMECODE,””:”ABCDE,”certificate_date":1546297200}
tshiono
  • 21,248
  • 2
  • 14
  • 22