Split massive yaml file into N valid yaml files

Question

I have a big yaml file:

---
foo: bar
baz:
  bacon: true
  eggs: false
---
goo: car
star:
  cheese: true
  water: false
---
dog: boxer
food:
  turkey: true
  moo: cow
---
...

What i'd like to do is split this file into n-number of valid yaml files.

I attempted doing this with csplit in bash:

But ultimately end up with either a lot more files than I want: csplit --elide-empty-files -f rendered- example.yaml "/---/" "{*}"

or a split where the last file contains most of the content: csplit --elide-empty-files -n 3 -f rendered- app.yaml "/---/" "{3}"

This is non-ideal. What I really want is the ability to say, split a yaml file in thirds where it splits on the closest delimiter. I know that won't always be truly thirds.

Any ideas on how to accomplish this in bash?

I am not yml expert. So, not sure what valid yml means. For the above input, can you show the outputs? `csplit --elide-empty-files -f rendered- example.yaml "/---/" "{*}"` seems to produce valid files. — anishsane, Sep 23 '19 at 04:42
@anishsane it does yes, but what i want is a file say split into 3 files, where it attempt to evenly distribute the valid yaml across those 3 files. Rather than split on `---` and have the third file contain all the remaining yaml — mootpt, Sep 23 '19 at 17:46
You can `grep -c '^---$'`, divide that by 3 and then use that number for `{repetition}`. e.g., if the file contains 50 entries, use `csplit --elide-empty-files -n 3 -f rendered- app.yaml "/---/" "{16}"` — anishsane, Sep 24 '19 at 03:19

score 2 · Answer 1 · answered Oct 08 '20 at 21:04

2

I don't think there's a way to do this with csplit. I was able to split it into files of 1000 yaml documents each with awk:

awk '/---/{f="rendered-"int(++i/1000);}{print > f;}' app.yaml

To get exactly three files, you could try something like:

awk '/---/{f="rendered-"(++i%3);}{print > f;}' app.yaml

answered Oct 08 '20 at 21:04

Neil

3,899
1
29
25

If the first line of the YAML isn't `---`, you'll need to add this to the awk: `BEGIN { f="rendered-0" }`. – Chris Jones Sep 12 '22 at 18:37

score 0 · Answer 2 · answered Sep 22 '19 at 22:58

0

My idea is not a one-liner, but this works.

#!/bin/bash
file=example.yaml
output=output_
count=$(cat ${file} | wc -l)
count=$((count + 1))
lines=$(grep -n -e '---' ${file} | awk -F: '{ print $1 }')
lines="${lines} ${count}"
start=$(echo ${lines} | awk '{ print $1 }')
lines=$(echo ${lines} | sed 's/^[0-9]*//')

for n in ${lines}
do
    end=$((n - 1))
    sed -n "${start},${end}p" ${file} > "${output}${start}-${end}.yaml"         
    start=$n
done

answered Sep 22 '19 at 22:58

Yuji

525
2
8

this seems similar to csplit? – mootpt Sep 23 '19 at 00:52
Please tell me the results for you want. – Yuji Sep 23 '19 at 02:50

Split massive yaml file into N valid yaml files

2 Answers2