How to split a file according to labels using pandas?

Question

I have a genome sequencing file in the following format:

chromosome name (string) | location (int) | readings (int)

Data for all chromosomes are stored in one single file and I wish to

split file into individual chromosome data files;
convert chromosome names e.g. 'chr1', 'x' to integers.

How can I do that with Pandas?

import pandas as pd
df = pd.read_csv('sample.txt', delimiter='\t', header=None)

The data look like this

0   chr1    3000573     0   
1   chr1    3000574     3   
2   chr2    3000725     1   
3   chr2    3000726     4   
4   chr3    3000900     1   
5   chr3    3000901     0

I can also reindex the data frame by the chromosome labels chr1, chr2, ...

You should probably make your question more specific, and concentrate on a particular problem with your code. Right now it's coming across as "here's a vaguely-described collection of tasks; please implement them for me", which I'm sure wasn't your intent. — DSM, Aug 28 '15 at 18:16
Yes I'm trying to group the chromosomes to there own files or find a way to pull the data of a single chromosome with a pandas command. I know how to do this for columns of a data frame, e.g. df['location'], is there anything like that for rows? — Machine, Aug 28 '15 at 18:22
do a sort and groupby, you could also just use the csv lib and a dict to group — Padraic Cunningham, Aug 28 '15 at 18:25
For others to be helpful, your question needs to be more specific. For example, include some sample data and illustrate the desired output. Please refer to this post for how to write good questions. http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — Alexander, Aug 28 '15 at 19:49

score 2 · Accepted Answer · answered Aug 28 '15 at 23:26

Writing each chromosome's data to an individual file can be done easily once the dataframe is spliced into pieces. Not quite sure what you mean by "convert chromosome names to integers" but if you mean given "chrx" you want x as an int, that's easy enough. Assuming you have chromosomes "chr1" through "chrn" where n is an integer:

import pandas
df = pandas.read_csv("sample.txt", delimiter="\t", header=None)
df.columns = ["index", "chrid", "location", "readings"]
chrs = []
for chrid in range(1,n):
    chr = df.loc[df["chrid"] == "chr"+str(chrid)]
    chr["chrid"] = map(lambda x: return int(x[3]), chr["chrid"])
    chrs.append(chr)
# chrs is now a list of dataframes, each for individual chromosome data

How to split a file according to labels using pandas?

1 Answers1