2

I know there's been several answers to questions regarding multiple delimiters, but my issue involves needing to delimit by multiple delimiters but not all of them. I have a file that contains the following:

((((((Anopheles_coluzzii:0.002798,Anopheles_arabiensis:0.005701):0.001405,(Anopheles_gambiae:0.002824,Anopheles_quadriannulatus:0.004249):0.002085):0,Anopheles_melas:0.008552):0.003211,Anopheles_merus:0.011152):0.068265,Anopheles_christyi:0.086784):0.023746,Anopheles_epiroticus:0.082921):1.101881;

It is newick format so all information is in one long line. What I would like to do is isolate all the numbers that follow another number. So for example the first number I would like to isolate is 0.001405. I would like to put that in a list with all the other numbers that follow a number (not a name etc).

I tried to use the following code:

with open("file.nh", "r") as f:
    for line in f:
        data = line
        z = re.findall(r"[\w']+", data)

The issue here is that this splits the list using "." as well as the other delimiters and this is a problem because all the numbers I require have decimal points.

I considered going along with this and converting the numbers in the list to ints and then removing all non-int values and 0 values. However, some of the files contain 0 as a value that needs to be kept.

So is there a way of choosing which delimiters to use and which to avoid when multiple delimiters are required?

spiral01
  • 545
  • 2
  • 17
  • Google "python newick". It is always essential to ask yourself "has someone else done this already?", often in life in general but especially in programming. – Alex Hall May 17 '16 at 15:42
  • Hi if it is BioPython you are referring to I have indeed looked through the documentation but I cannot deduce how to obtain what I need from it, which is extracting the internal branch lengths of my trees. I am not suggesting it cannot be done in BioPython, as I'm sure there must be a way, but having had no success I decided to parse the file manually with python. – spiral01 May 17 '16 at 15:52
  • isolate all numbers follow another number... what to do in this case: `Anopheles_quadriannulatus:0.004249):0.002085):0`: Do you want 0.002085 and 0 or just the first or last one? – Günther Jena May 17 '16 at 15:54
  • 1
    I am referring to [this package](https://github.com/glottobank/python-newick) which is the first Google result I got. Also if you are having trouble with BioPython then ask a question about it. Parsing a tree structure ([such as HTML](http://stackoverflow.com/a/1732454/2482744)) with regexes is not a good idea. – Alex Hall May 17 '16 at 16:00

1 Answers1

2

It's not necessary to split by multiple but not all delimiters if you set up your regex to catch the wanted parts. By your definition, you could use every number after ):. Using the re module a possible solution is this:

with open("file.nh", "r") as f:
    for line in f:
        z = re.findall(r"\):([0-9.]+)", line)
        print(z)

The result is:

['0.001405', '0.002085', '0', '0.003211', '0.068265', '0.023746', '1.101881']

r"\):([0-9.]+)" is searching for ): followed by a part with numbers or decimal point. The second part is the result and is therefore inside parenthesis.

As Alex Hall mentioned in most cases it's not a good idea to use regex if the data is well structured. Watch out for libraries working with the given data structure instead.

Günther Jena
  • 3,706
  • 3
  • 34
  • 49
  • Thank you, this is exactly what I needed. Out of interest, why is it not a good idea to use regex if the data is well structured? – spiral01 May 17 '16 at 17:17
  • It's completely OK for a quick solution. It's not a good idea, when one of the following things applies: 1) You're processing data from different source - so it's important to have a robust and flexible parser (whitespaces, format and so on can differ). 2) You're processing huge amounts of data (but there are cases where regex is still the fastest option...) 3) You're transforming the data - it's hard to do that string based – Günther Jena May 17 '16 at 22:22