I would like to use the open data IMDb, but they serve it in TSV format, which is not very convenient.
https://datasets.imdbws.com/title.crew.tsv.gz
tconst directors writers
tt0000238 nm0349785 \N
tt0000239 nm0349785 \N
tt0000240 \N \N
tt0000241 nm0349785 \N
tt0000242 nm0617588 nm0617588
tt0000243 nm0349785 \N
tt0000244 nm0349785 \N
tt0000245 \N \N
tt0000246 nm0617588 \N
tt0000247 nm0002504,nm0005690,nm2156608 nm0000636,nm0002504
tt0000248 nm0808310 \N
tt0000249 nm0808310 \N
tt0000250 nm0005717 \N
tt0000251 nm0177862 \N
I want to convert TSV data to JSON.
[
{
"tconst": "tt0000247",
"directors": [
"nm0005690",
"nm0002504",
"nm2156608"
],
"writers": [
"nm0000636",
"nm0002504"
]
},
{
"tconst": "tt0000248",
"directors": [
"nm0808310"
],
"writers": [
"\\N"
]
}
]
I can do this with the command:
jq -rRs 'split("\n")[1:-1] |
map([split("\t")[]|split(",")] | {
"tconst":.[0][0],
"directors":.[1],
"writers":.[2]
}
)' ./title.crew.tsv > ./title.crew.json
However, the file turns out to be very large, I get out of memory errors.
1. How can split this TSV file into several JSON files, each with 1000 records?
./title.crew.page1.json
./title.crew.page2.json
./title.crew.page3.json
2. How can exclude empty fields? To have an empty array.
"writers": [ "\\N" ]
-> "writers": [ ]
UPD (The second question was solved.):
jq -rRs 'split("\n")[1:-1] |
map([split("\t")[]|split(",")] |
.[2] |= if .[0] == "\\N" then [] else . end | {
"tconst":.[0][0],
"directors":.[1],
"writers":.[2]
}
)' ./title.crew.tsv > ./title.crew.json
[
{
"tconst": "tt0000247",
"directors": [
"nm0005690",
"nm0002504",
"nm2156608"
],
"writers": [
"nm0000636",
"nm0002504"
]
},
{
"tconst": "tt0000248",
"directors": [
"nm0808310"
],
"writers": []
}
]
Thanks for answers.