I'm trying to extract strings from a .txt file with a few thousand sequences and write a CSV with these strings. I have deleted all of the irrelevant information from the original .txt file and this is the format of the document I have now:
DEFINITION Homo sapiens haplogroup HV5 mitochondrion, complete genome.
ACCESSION DQ377992
/haplogroup="HV5"
/pop_variant="Ashkenazi Jew"
/note="ethnicity:Ashkenazi Jew; origin_locality:Belarus:Homel' Volast', Vyetka; origin_coordinates:52.51 N 31.17 E"
DEFINITION Homo sapiens haplotype U5b1c mitochondrion, complete genome.
ACCESSION DQ661681
/haplotype="U5b1c"
/note="Native American (Cherokee)"
I am trying to extract the accession numbers, haplotype or haplogroup, ethnicity, location (origin_locality), coordinates (origin_coordinates) and any additional information that might have been put in /note=
to a csv. One of the problems I am facing is that not every sequence has all of the information and not all of the strings are in their own quotation marks.
How do I extract the accession numbers, the strings between quotation marks and make sure that I am extracting the right strings to the right sequence? Also how would I deal with the strings that are only separated by semicolons?
edit: The other question does not address missing information or the resulting alignment in a CSV which was my primary concern.