0

I'm trying to extract strings from a .txt file with a few thousand sequences and write a CSV with these strings. I have deleted all of the irrelevant information from the original .txt file and this is the format of the document I have now:

DEFINITION  Homo sapiens haplogroup HV5 mitochondrion, complete genome.
ACCESSION   DQ377992
/haplogroup="HV5"
/pop_variant="Ashkenazi Jew"
/note="ethnicity:Ashkenazi Jew; origin_locality:Belarus:Homel' Volast', Vyetka; origin_coordinates:52.51 N 31.17 E"
DEFINITION  Homo sapiens haplotype U5b1c mitochondrion, complete genome.
ACCESSION   DQ661681
/haplotype="U5b1c"
/note="Native American (Cherokee)"

I am trying to extract the accession numbers, haplotype or haplogroup, ethnicity, location (origin_locality), coordinates (origin_coordinates) and any additional information that might have been put in /note= to a csv. One of the problems I am facing is that not every sequence has all of the information and not all of the strings are in their own quotation marks.

How do I extract the accession numbers, the strings between quotation marks and make sure that I am extracting the right strings to the right sequence? Also how would I deal with the strings that are only separated by semicolons?

edit: The other question does not address missing information or the resulting alignment in a CSV which was my primary concern.

  • [Click here](https://docs.python.org/2/howto/regex.html) to begin your journey into magical world of pattern matching in python. – Endzior Jun 07 '15 at 19:54
  • @Endzior I appreciate regex is probably the easiest way of doing this but sending me a link on it does not help me figure out how to effectively keep each sequence separate so one missing string does not mess up all of the results. – Modularized Jun 07 '15 at 19:58
  • You can use regex to find things you are interested in for instance : `^ACCESSION\s([A-Z0-9]*)$` will net you all accesion numbers from the string – Endzior Jun 07 '15 at 20:00
  • what exactly do you want from your input provided? – Padraic Cunningham Jun 07 '15 at 20:03
  • @PadraicCunningham I want a csv with a column for accession numbers, haplotype, location etc and then all of the values for an accession number (which represents a single sequence) in one row. How do I put in empty values if a sequence does not, for example, have a known haplotype? – Modularized Jun 07 '15 at 20:06
  • @Modularized. I would probably try to reformat the file into some usable format that you could use the csv module with, it is easy extract the data but missing data would be a little harder – Padraic Cunningham Jun 07 '15 at 20:12
  • possible duplicate of [Regex Match all characters between two strings](http://stackoverflow.com/questions/6109882/regex-match-all-characters-between-two-strings) – Alexander McFarlane Jun 07 '15 at 20:30
  • @Modularized this is similar to another question, but adding matching criteria using the `regex` below should be a good starting point. If you are adding to an existing list of data, perhaps look at using `pandas` library which will just a column blank if you don't assign any data to it. You can then create a `csv` from your `DataFrame()` object – Alexander McFarlane Jun 07 '15 at 20:52

1 Answers1

2

You can create a class with all possible parameters as attributes. Then loop through all lines, with creating a new object whenever required (i.e., when line starts with 'Definition') and filling up attribute values of that object. After that you can reference that object and write its atrributes' value in the csv.

Vikas Ojha
  • 6,742
  • 6
  • 22
  • 35