0

I'm trying to replace newline characters in a JSON string, because Postgres' COPY function won't work otherwise.

I've been trying different combinations of sed, such as:

gsed -z 's/\\n/\\\\n/g'
gsed -E ':a;N;$!ba;s/\r{0,1}\n/\\n/g'
gsed '/}$/{N; s/}\n{/},\n{/}

But it either replaces all of them or none.

Here's my sample data:

{"corpusid":75399585,"externalids":{"ACL":null,"DBLP":null,"ArXiv":null,"MAG":"2350229292","CorpusId":"75399585","PubMed":null,"DOI":null,"PubMedCentral":null},"url":"https://www.semanticscholar.org/paper/aec9799d6e12f2000deb8f7dadcc7d7e653f7393","title":"Prevention and Therapy of Osteoporosis","authors":[{"authorId":"40937034","name":"Hu Jun-bo"}],"venue":"","publicationvenueid":null,"year":2003,"referencecount":0,"citationcount":0,"influentialcitationcount":0,"isopenaccess":false,"s2fieldsofstudy":[{"category":"Medicine","source":"s2-fos-model"},{"category":"Medicine","source":"external"}],"publicationtypes":null,"publicationdate":null,"journal":null,"updated":"2022-02-11T07:19:53.877Z"}
{"corpusid":83138554,"openaccessinfo":{"externalids":{"MAG":"883546316","ACL":null,"DOI":null,"PubMedCentral":null,"ArXiv":null},"license":null,"url":null,"status":null},"abstract":"Phytosulfokin-alpha (PSK-alpha) ist ein bisulfatiertes Pentapetid, welches als Wachstumsfaktor die Proliferation von Zellen in Zellkultur niedriger Dichte stimuliert. In der vorliegenden Arbeit wurden physiologische und genetische Ansatze verfolgt, um Funktionen von PSK-alpha in Arabidopsis thaliana aufzuklaren. Promotor-GUS Expressionsanalysen der PSK-Praproproteingene und des PSK-Rezeptorgens AtPSKR1 in Wurzeln und Bluten deuteten auf mogliche Funktionen von PSK-alpha in diesen Geweben hin. \nExogen appliziertes PSK-alpha forderte das Wurzelwachstum von Arabidopsiskeimlingen. Das Wurzelwachstum der PSK-Rezeptor T-DNA Insertionsmutante Atpskr1-T war inhibiert. Diese Ergebnisse zeigten, dass PSK-alpha in die Regulation von Wurzelwachstum involviert ist. \nCharakteristisch fur die Atpskr1-T Mutante war eine verringerte Samenbildung im Vergleich zu Wildtyppflanzen. Genetische Analysen zeigten, dass sowohl ein paternaler als auch maternale Effekte und asynchrones Wachstum der inneren Blutenorgane zur Ausbildung des Mutantenphanotyps fuhren.","updated":"2022-02-11T01:27:48.987Z"}

As you can see, second JSON string has \n within the "abstract" field, it needs to be replaced to \\n I assume for the COPY to work.

But I have to retain newline in between JSON strings, as there's thousands of them, so that COPY reads that file properly.

deb0rian
  • 966
  • 1
  • 13
  • 37
  • It's a list of JSON strings as mentioned previously. sed is working line-by-line, so it should do the trick with replacing, I just need the right... regex. Other than that that file structure works very well with Postgres' COPY – deb0rian Jan 16 '23 at 21:33
  • 1
    Using a regex to work with JSON is wrong in many of the same ways as [using a regex to parse HTML](https://stackoverflow.com/a/1732454/14122). Regular expressions are only able to parse regular languages; see also [Is JSON a regular language?](https://cstheory.stackexchange.com/questions/3987/is-json-a-regular-language) on [cstheory.se]. (Do they no longer teach the [Chomsky hierarchy](https://en.wikipedia.org/wiki/Chomsky_hierarchy) to comp sci undergrads?) – Charles Duffy Jan 16 '23 at 22:18

2 Answers2

1

A workaround without sed:

head -n 1 file > new_file
tail -n 1 file | jq -c '.abstract |= (gsub("\n"; "\\n"))' >> new_file
Cyrus
  • 84,225
  • 14
  • 89
  • 153
  • That would be very limited to the specific sample data. There's actually millions of such JSON strings, some have abstract field, some not, some have the \n character in another JSON field. – deb0rian Jan 16 '23 at 22:11
  • @demorph, jq has generic "recurse through this structure and map every string through this function" primitives. It's very much the right tool for the job at hand, even if the question wasn't asked expansively enough to describe your real requirements. – Charles Duffy Jan 16 '23 at 22:13
  • @demorph, ...in particular, try combining this answer with the answer to [jq: replace string in full JSON](https://stackoverflow.com/questions/48254985/jq-replace-string-in-full-json). – Charles Duffy Jan 16 '23 at 22:15
  • (That said, I don't know what the point of the `head` and `tail` usage is here; if the goal is to update _all_ lines, one should just pipe the whole file through). – Charles Duffy Jan 16 '23 at 22:22
0

I have reused the portion with jq by Cyrus and wrapped it into a sh script, as I had many files I needed to convert. Please note it also replaces \" with " as it was preventing COPY function from properly importing JSON strings.

Here's the result:

#!/bin/bash

input_file=$1
output_file=$2

while read -r line; do
  if echo "$line" | jq -e 'has("abstract")' > /dev/null; then
    echo "$line" | jq -c 'try .abstract |= (gsub("\\n"; "\\n") | (gsub("\""; """)))' | tee -a $output_file > /dev/null
  else
    echo "$line" | tee -a $output_file > /dev/null
  fi
done < $input_file

tail -n 1 $input_file | jq -c 'try .abstract |= (gsub("\\n"; "\\n") | (gsub("\""; "&quot;")))' >> $output_file

Usage:

./script_name.sh original_file.json converted_file.json

deb0rian
  • 966
  • 1
  • 13
  • 37