0

I have thousands json files created from a old php application that will be imported into a new version being developed in rails.

In php i'm running json_encode($object) to encode the item before saving it.

Here is a edited down version of the json that is being produced. The description field is where I'm seeing the unicode character.

{ "ID": "", "parentID": "", "formID": "", "defaultProject": "", "data": { "title": "", "idno": "", "date": "2016-11-09", "creator": [ ], "contributor": [ ], "itemNumber": "", "oclcNumber": "", "publisher": "", "publisherLocation": "", "description": "<..contents removed> family\u00e2\u0080\u0099s land <..contents removed> \r\n", "subject": [ ], "type": "", "provenanceDpla": "", "rights": "", "location": [ ], "timePeriod": "", "format": [ ], "language": [ ], "source": "", "extent": "" } }, "metadata": "", "idno": "", "modifiedTime": "", "createTime": "", "modifiedBy": "", "createdBy": "", "publicRelease": "" }

The part that we are having issues with is in the description field. The original looks like.

enter image description here

When I view the imported record this part looks like.

garbage characters

Inspecting the item in the rails console that looks like this.

enter image description here

I'm using @hash = JSON.parse(File.read(file)) does anyone have a good recommendation on how to handle this. I'm sure we will find this more as we work on exporting the content.

Tracy McCormick
  • 401
  • 1
  • 7
  • 18
  • What is your definion of "handle it"? To convert the files to UTF-8? – max May 03 '22 at 17:45
  • These look like noise from legacy systems as some of these characters are [ characters](https://www.utf8-chartable.de/) and you maybe be able to clean them. For more context, could you share a snippet of these json files? Also, which encoding these files are using? – daniloisr May 03 '22 at 17:49
  • 1
    `0xE28099` is the UTF-8 representation of the right-single quote https://www.fileformat.info/info/unicode/char/2019/index.htm – Chris Haas May 03 '22 at 17:50
  • @Tracy, I'm not quite sure the problem. Is it that you are getting Unicode escape sequences, and your Ruby JSON parsing logic isn't unescaping them for you? – Chris Haas May 03 '22 at 17:55
  • I updated my question with additional information and what I'm seeing in the item and what I see in the view. – Tracy McCormick May 03 '22 at 19:50
  • @TracyMcCormick try encoding the json in php using `json_encode($object, JSON_UNESCAPED_UNICODE)` [ref: https://stackoverflow.com/a/13478887/1042324] – daniloisr May 04 '22 at 13:22
  • @daniloisr I just tried that and re-imported that didn't change anything. – Tracy McCormick May 04 '22 at 16:12
  • @TracyMcCormick I see, let's try finding out which encoding PHP is using so we can use Ruby's `String#encode` to fix it. Try https://www.php.net/manual/en/function.mb-detect-encoding.php oh PHP so see what it returns for `$object->description` – daniloisr May 04 '22 at 17:55
  • @TracyMcCormick also, try checking the encoding of the .json file with https://www.freedesktop.org/wiki/Software/uchardet. The problem seems that Ruby is trying to read the file as utf8 but it isn't, because `\u00e2\u0080\u0099` should be just `\u2019` in utf8. I tried many different ways of encoding `\u00e2\u0080\u0099` as `\u2019` in Ruby, but no success until now – daniloisr May 04 '22 at 19:43
  • In the php export I was trying to prepare the data by doing `$utf_encoded = mb_convert_encoding( $item, 'UTF-8' );` removing this it now gives me the code `\u2019` instead. It imported correctly. – Tracy McCormick May 05 '22 at 18:02

1 Answers1

1

In the php export I was running $utf_encoded = mb_convert_encoding( $item, 'UTF-8' ); to insure that everything was encoded as utf-8 but for some reason this was producing the above result. Removing this gave me a unicode of \u2019 instead of \u00e2\u0080\u0099 which importing into the new rails app worked correctly.

Tracy McCormick
  • 401
  • 1
  • 7
  • 18