Bash/*NIX: split a file into multiple files on a substring

Question

Variants of this question have been asked and answered before, but I find that my sed/grep/awk skills are far too rudimentary to work from those to a custom solution since I hardly ever work in shell scripts.

I have a rather large (100K+ lines) text file in which each line defines a GeoJSON object, each such object including a property called "county" (there are, all told, 100 different counties). Here's a snippet:

{"type": "Feature", "properties": {"county":"ALAMANCE", "vBLA": 0, "vWHI": 4, "vDEM": 0, "vREP": 2, "vUNA": 2, "vTOT": 4}, "geometry": {"type":"Polygon","coordinates":[[[-79.537429,35.843303],[-79.542428,35.843303],[-79.542428,35.848302],[-79.537429,35.848302],[-79.537429,35.843303]]]}},
{"type": "Feature", "properties": {"county":"NEW HANOVER", "vBLA": 0, "vWHI": 0, "vDEM": 0, "vREP": 0, "vUNA": 0, "vTOT": 0}, "geometry": {"type":"Polygon","coordinates":[[[-79.532429,35.843303],[-79.537428,35.843303],[-79.537428,35.848302],[-79.532429,35.848302],[-79.532429,35.843303]]]}},
{"type": "Feature", "properties": {"county":"ALAMANCE", "vBLA": 0, "vWHI": 0, "vDEM": 0, "vREP": 0, "vUNA": 0, "vTOT": 0}, "geometry": {"type":"Polygon","coordinates":[[[-79.527429,35.843303],[-79.532428,35.843303],[-79.532428,35.848302],[-79.527429,35.848302],[-79.527429,35.843303]]]}},

I need to split this into 100 separate files, each containing one county's GeoJSONs, and each named xxxx_bins_2016.json (where xxxx is the county's name). I'd also like the final character (comma) at the end of each such file to go away.

I'm doing this in Mac OSX, if that matters. I hope to learn a lot by studying any solutions you could suggest, so if you feel like taking the time to explain the 'why' as well as the 'what' that would be fantastic. Thanks!

EDITED to make clear that there are different county names, some of them two-word names.

In your example, all lines are named with county `ALAMANCE`? — Inian, Feb 24 '17 at 17:56
Nope. There are 100 different county names. And some of them are two-word names, but they are all enclosed in quotes. — CheesyGrits, Feb 24 '17 at 17:58
You should take a stab at this. SO is not a script-writing service. As asked, this is not really an appropriate question. — Greg Tarsa, Feb 24 '17 at 17:59
@user1661497: Is the part `"county"` represent the county name? If they are all same in example, how do you expect to create unique files out of that? Update your sample input with your actual lines — Inian, Feb 24 '17 at 18:00
I hear you. But my efforts to date have just produced garbage...possibly because I get really confused about regex, and what characters to escape, and when. So my 'stabs' so far have done nothing but draw blood, unfortunately. — CheesyGrits, Feb 24 '17 at 18:02

chepner · Accepted Answer · 2017-02-24T18:31:48.840

6

jq can kind of do this; it can group the input and output one line of text per group. The shell then takes care of writing each line to an appropriately named file. jq itself doesn't really have the ability to open files for writing that would allow you to do this in a single process.

jq -Rn -c '[inputs[:-1]|fromjson] | group_by(.properties.county)[]' tmp.json |
  while IFS= read -r line; do
    county=$(jq -r '.[0].properties.county' <<< $line)
    jq -r '.[]' <<< "$line" > "$county.txt"
done

[inputs[:-1]|fromjson] reads each line of your file as a string, strips the trailing comma, then parses the line as JSON and wraps the lines into a single array. The resulting array is sorted and grouped by county name, then written to standard output, one group per line.

The shell loop reads each line, extracts the county name from the first element of the group with a call to jq, then uses jq again to write each element of the group to the appropriate file, again one element per line.

(A quick look at https://github.com/stedolan/jq/issues doesn't appear to show any requests yet for an output function that would let you open and write to a file from inside a jq filter. I'm thinking of something like

jq -Rn '... | group_by(.properties.county) | output("\(.properties.county).txt")' tmp.json

without the need for the shell loop.)

edited Feb 24 '17 at 18:31

answered Feb 24 '17 at 18:00

chepner

497,756
71
530
681

I literally can't believe I came with exactly similar logic. Do you think I should keep it (or) yours is more robust? – Inian Feb 24 '17 at 18:02
4

Yours is simpler; mine is more efficient because I only open each output file once. (Assuming an even spread of data, there are about 1000 lines of output per county, so you have to open each output file 1000 times.) – chepner Feb 24 '17 at 18:05
1

Holy cow. I was completely unaware of jq. Just a quick glance at the git suggests it might well be the answer to my dreams, because I find sed/grep/awk nightmarish to use with jsons. Thank you soooooo much!!!! – CheesyGrits Feb 24 '17 at 18:08
1

@user1661497: Use this answer, it being more robust than mine! – Inian Feb 24 '17 at 18:10
Yes, it does seem like an "output" function would be a useful addition to jq, since its main value seems to be for people like me who don't have strong shell scripting skills. After I've become more knowledgeable about jq I may post a request for such a function. – CheesyGrits Feb 24 '17 at 18:23

score 5 · Answer 2 · edited May 23 '17 at 11:46

^{If using string parsing rather than proper JSON parsing to extract the county name is acceptable - brittle in general, but would work in this simple case - consider Sam Tolton's GNU awk answer, which has the potential to be by far the simplest and fastest solution.}

To complement chepner's excellent answer with a variation that focuses on performance:

jq -Rrn '[inputs[:-1]|fromjson] | .properties.county + "|" + (.|tostring)' file |
  awk -F'|' '{ print $2 > ($1 "_bins_2016.json") }'

Shell loops are avoided altogether, which should speed up the operation.

The general idea is:

Use jq to trim the trailing , from each input line, interpret the trimmed string as JSON, extract the county name, then output the trimmed JSON strings prepended with the county name and a distinct separator, |.
Use an awk command to split each line into the prepended county name and the trimmed JSON string, which allows awk to easily construct the output filename and write the JSON string to it.

Note: The awk command keeps all output files open until the script has finished, which means that, in your case, 100 output files will be open simultaneously - a number that shouldn't be a problem, however.

In cases where it is a problem, you can use the following variation, in which jq first sorts the lines by county name, which then allows awk to immediately close the previous output field whenever the next county is reached in the input:

jq -Rrn '
  [inputs[:-1]|fromjson] | sort_by(.properties.county)[] | 
    .properties.county + "|" + (.|tostring)
' file | 
   awk -F'|' '
    prevCounty != $1 { if (outFile) close(outFile); outFile = $1 "_bins_2016.json" }
    { print $2 > outFile; prevCounty = $1  }
  '

score 4 · Answer 3 · edited May 23 '17 at 12:16

A simpler version of chepner's answer:

while IFS= read -r line
do 
    countyName=$(jq --raw-output '.properties.county' <<<"${line: : -1}")
    jq <<< "${line: : -1}" >> "$countyName"_bins_2016.json
done<file

The idea is to filter the county name using a jq filter after stripping the , from each line of your input file. Then the line is passed to jq as plain stream to produce a JSON file in prettified format.

If you are from a relatively older version of bash (< 4.0) use "${line%?}" over "${line: : -1}"

For example with the change above, one of your county becomes,

cat ALAMANCE_bins_2016.json
{
  "type": "Feature",
  "properties": {
    "county": "ALAMANCE",
    "vBLA": 0,
    "vWHI": 0,
    "vDEM": 0,
    "vREP": 0,
    "vUNA": 0,
    "vTOT": 0
  },
  "geometry": {
    "type": "Polygon",
    "coordinates": [
      [
        [
          -79.527429,
          35.843303
        ],
        [
          -79.532428,
          35.843303
        ],
        [
          -79.532428,
          35.848302
        ],
        [
          -79.527429,
          35.848302
        ],
        [
          -79.527429,
          35.843303
        ]
      ]
    ]
  }
}

Note: The current solution could be performance intensive as reading file line by line is an expensive operation, and equally invoking jq for each of the lines.

Thanks for the alternative take on the same solution. The simplicity vs. efficiency difference is interesting and instructive. — CheesyGrits, Feb 24 '17 at 18:10

score 2 · Answer 4 · edited May 23 '17 at 12:32

2

This will do what you want minus getting rid of the last comma:-

gawk 'match($0, /"county":"([^"]+)/, array){ print >array[1]"_bins_2016.json" }' INPUT_FILE

This will output files in the current path with a filename in the format COUNTRY NAME_bins_2016.json.

The script goes line by line and uses a regex to match the exact term "country":" followed by 1 or more characters that aren't a ". It captures the characters within the quotes and then uses it as part of the filename to append the current line to.

To remove the trailing comma from all .json files in the current path you could use:-

sed -i '$ s/,$//' *.json

If you were certain that the last char was always a comma, a faster solution would be to use truncate:-

truncate -s-1 *.json

Last part taken from this answer: https://stackoverflow.com/a/40568723/1453798

edited May 23 '17 at 12:32

Community

1
1

answered Feb 24 '17 at 18:13

Sam Tolton

347
6
14

This is promising, and in this case the simplest and fastest solution (caveats re string parsing of JSON and keeping all output files open notwithstanding). I encourage you to add trimming of the trailing comma and to add an explanation. – mklement0 Feb 24 '17 at 20:10
1

I second mklement0's request for an explanation. This is the kind of awe-inspiring scripting that those of us who visit such issues only once in a blue moon can only marvel at...not really understand. – CheesyGrits Feb 24 '17 at 23:18
Hunh...after installing gawk, your script just immediately returns a command prompt, but no output files, no errors, nothing. Operator error? – CheesyGrits Feb 25 '17 at 02:59
You should have 2 new files in your current path called ALAMANCE_bins_2016.json and NEW HANOVER_bins_2016.json. Removing the trailing comma from each file is problematic as gawk writes line by line. You could write a second command to remove the trailing comma from all .json files in the current path though. – Sam Tolton Feb 27 '17 at 09:47

Greg Tarsa · Answer 5 · 2017-02-25T17:01:51.677

1

Here is a quickie script that will do the job. It has the virtue of working on most systems without having to install any other tools.

IFS=$'\n'
counties=( $( sed 's/^.*"county":"//;s/".*$//' counties.txt ) )
unset IFS

for county in "${!counties[@]}"
do
  county="${counties[$i]}"
  filename="$county".out.txt
  echo "'$filename'"
  grep "\"$county\"" counties.txt > "$filename"
done

The setting of IFS to \n allows the array elements to contain spaces. The sed command strips off all the text up to the start of the county name and all the text after it. The for loop is the form that allows iterating over the array. Finally, the grep command needs to have double quotes around the search string so that counties that are substrings of other counties don't accidentally get put into the wrong file.

See this section of the GNU BASH Reference Manual for more info.

edited Feb 25 '17 at 17:01

answered Feb 24 '17 at 18:38

Greg Tarsa

1,622
13
18

(Also, what's the purpose of iterating over indices rather than `for county in "${counties[@]}"`? The code you're using here will fail in cases where arrays are sparse: Consider if after populating counties, the 3rd one had been deleted with `unset counties[2]` -- this code would still try to iterate over the missing slot, *and* it would miss the last entry in the list). – Charles Duffy Feb 24 '17 at 18:43
@CharlesDuffy, Fair enough. I changed it to the GNU Reference manual. That is usually what I reference anyway. I stumbled on the TLDP one first. – Greg Tarsa Feb 24 '17 at 18:44
...if you *do* want to iterate over indices, the better way to do that is `for i in "${!counties[@]}"` -- works properly with sparse arrays, associative arrays, &c. – Charles Duffy Feb 24 '17 at 18:45
@CharlesDuffy, Thanks for that tip. I updated my answer accordingly and learned something new. – Greg Tarsa Feb 24 '17 at 18:47
1

Consider `IFS=$'\n' read -r -d '' -a counties < <(sed 's/^.*"county":"//;s/".*$//' counties.txt && printf '\0')` -- that way we're scoping the `IFS` change without needing to change it back later, and a county of `*` won't be replaced with a list of filenames. (The `printf '\0'` makes sure that our `read` has an exit status of 0 if `sed` succeeds, thus not messing up `ERR` traps or `set -e`). – Charles Duffy Feb 24 '17 at 18:47
1

(If you don't want to switch from string-splitting to `read -a`, the other way to avoid globbing is `set -f`). – Charles Duffy Feb 24 '17 at 18:53
When I run this script (substituting my input file name for counties.txt in line 2), it outputs the string "{.out.txt" (without the quotes) numerous times to stdout, and creates a file of the same name which it keeps overwriting. That file is zero bytes. – CheesyGrits Feb 25 '17 at 03:18
Stupid mistake on my part. I was iterating the loop with a `for i...` and should have been iterating with `for county...` Fixed now. – Greg Tarsa Feb 25 '17 at 17:02

Bash/*NIX: split a file into multiple files on a substring

5 Answers5