I have a Yaml file with more than 20K lines and it seems it has more than 2K duplicated keys. I'm looking for an automatic way to remove all the duplicated keys from the file. PHPSTROM gives this functionality to remove them one by one but It's a big file. any online tool or custom code in PHP, Go, Python will be welcome to help me to remove duplicate keys.
2 Answers
Since YAML disallows duplicate keys, strictly speaking, your file is not a YAML file. However, you can use the low-level event API to walk over your file. This API is, in all implementations I know, too low-level to already check for duplicate keys (while most APIs that give you a YAML DOM would have removed duplicates one way or another).
Here's how to do it in Python:
import yaml, sys
from yaml.events import *
input = """
spam: egg
sausage: spam
baked beans: spam
spam: [spam, spam, spam]
baked beans: egg
tomato: [spam, spam]
"""
events = []
level = 0
seen = set()
is_key = False
skip = False
for event in yaml.parse(input):
if level == 1: is_key = not is_key
if isinstance(event, yaml.CollectionStartEvent):
level += 1
elif isinstance(event, CollectionEndEvent):
level -= 1
if level == 1 and skip:
skip = False
continue
elif isinstance(event, ScalarEvent):
if level == 1:
if is_key:
if event.value in seen:
print("skipping duplicate key: " + event.value)
skip = True
else: seen.add(event.value)
else:
if skip:
skip = False
continue
if not skip: events.append(event)
print("Result:\n---")
yaml.emit(events, sys.stdout)
Output:
skipping duplicate key: spam
skipping duplicate key: baked beans
Result:
---
spam: egg
sausage: spam
baked beans: spam
tomato: [spam, spam]
go-yaml does not give you access to the event stream so it's not possible there. Mind that loading and dumping YAML cannot perfectly preserve style (e.g. comments), this question discusses that issue in detail.

- 35,506
- 7
- 89
- 126
I noticed yaml_parse_file convert keys as an array index and the same keys automatically will be over-written.
I used this solution to easily convert the YAML file into the array and array to the YAML file.
All worked well.
$res = yaml_parse_file("./myfile.yaml");
yaml_emit_file("./myfile_with_no_duplicate.yaml", $res);
I provide a sample to show it's working
<?php
$yaml = <<<EOD
testKey:
redirects:
paths:
/about:
to: somewhere
code: 301
/about:
to: somewhere
code: 301
EOD;
$parsed = yaml_parse($yaml);
var_dump($parsed);
$with_no_duplicates = yaml_emit($parsed);
var_dump($with_no_duplicates);
?>
result:
array(1) {
["testKey"]=>
array(1) {
["redirects"]=>
array(1) {
["paths"]=>
array(1) {
["/about"]=>
array(2) {
["to"]=>
string(9) "somewhere"
["code"]=>
int(301)
}
}
}
}
}
string(95) "---
testKey:
redirects:
paths:
/about:
to: somewhere
code: 301
...
"

- 7,931
- 11
- 67
- 103