1

Issue with Unix Split command for splitting large data: split -l 1000 file.json myfile. Want to split this file into multiple files of 1000 records each. But Im getting the output as single file - no change.

P.S. File is created converting Pandas Dataframe to JSON.

Edit: It turn outs that my JSON is formatted in a way that it contains only one row. wc -l file.json is returning 0

Here is the sample: file.json

[
{"id":683156,"overall_rating":5.0,"hotel_id":220216,"hotel_name":"Beacon Hill Hotel","title":"\u201cgreat hotel, great location\u201d","text":"The rooms here are not palatial","author_id":"C0F"},
{"id":692745,"overall_rating":5.0,"hotel_id":113317,"hotel_name":"Casablanca Hotel Times Square","title":"\u201cabsolutely delightful\u201d","text":"I travelled from Spain...","author_id":"8C1"}
]
Shubham Jain
  • 5,327
  • 2
  • 15
  • 38
The Code Geek
  • 942
  • 1
  • 7
  • 13
  • Please clarify the requirements, e.g. by showing what each partition would look like. See [mcve] for further guidance. – peak Jun 27 '20 at 18:59

3 Answers3

3

Invoking jq once per partition plus once to determine the number of partitions would be extremely inefficient. The following solution suffices to achieve the partitioning deemed acceptable in your answer:

jq -c ".[]" file.json | split -l 1000

If, however, it is deemed necessary for each file to be pretty-printed, you could run jq -s . for each file, which would still be more efficient than running .[N:N+S] multiple times.

If each partition should itself be a single JSON array, then see Splitting / chunking JSON files with JQ in Bash or Fish shell?

peak
  • 105,803
  • 17
  • 152
  • 177
1

After asking elsewhere, the file was, in fact a single line.

Reformatting with JQ (in compact form), would enable the split, though to process the file would at least need the first and last character to be deleted (or add '[' & ']' to the split files)

Alister Bulman
  • 34,482
  • 9
  • 71
  • 110
  • I tried using jq but the problem isn't resolved. I ran this: cat merged.json | jq > newfile.json. It created new line for every object. Like 'id' is one line, 'overall_rating' is next line. So the data is still not usable with Unix Split command. – The Code Geek Jun 27 '20 at 15:48
0

I'd recommend spliting the JSON array with jq (see manual).

cat file.json | jq length              # get length of an array
cat file.json | jq -c '.[0:999]'       # first 1000 items
cat file.json | jq -c '.[1000:1999]'   # second 1000 items
...

Notice -c for compact result (not pretty printed).

For automation, you can code a simple bash script to split your file into chunks given the array length (jq length).

ΔO 'delta zero'
  • 3,506
  • 1
  • 19
  • 31