How to convert chromosome name to same format in pyranges before performing a join

Question

I have multiple .bed files and I want to perform join, intersection etc. operation on them. I am using pyranges library to read the .bed files and perform these operations. As .bed files allows naming chromosome with or without "chr" prefix, I would like to format all chromosome name in different .bed files to the same format before performing the operations. Therefore, operations results in outputs as expected.

I tried,

>>> import pandas as pd
>>> import pyranges as pr
>>> df1 = pd.DataFrame({"Chromosome": ["chr1", "chr2"], "Start": [100, 200],
...                    "End": [150, 201]})
>>> py1 = pr.PyRanges(df1)
>>> df2 = pd.DataFrame({"Chromosome": ["1", "2"], "Start": [1000, 2000],
...                    "End": [1500, 20010]})
>>> py2 = pr.PyRanges(df2)
>>> def modify_chrom_series(df):
...    df.Chromosome = df.Chromosome.apply(lambda val: val.replace("chr", ""))
...    return df
>>> def fix_chrom(regions):
...    return regions.apply(modify_chrom_series)
>>> py1 = fix_chrom(py1)
>>> py1
+--------------+-----------+-----------+
|   Chromosome |     Start |       End |
|   (category) |   (int32) |   (int32) |
|--------------+-----------+-----------|
|            1 |       100 |       150 |
|            2 |       200 |       201 |
+--------------+-----------+-----------+
>>> py2 = fix_chrom(py2)
>>> py2

+--------------+-----------+-----------+
|   Chromosome |     Start |       End |
|   (category) |   (int32) |   (int32) |
|--------------+-----------+-----------|
|            1 |      1000 |      1500 |
|            2 |      2000 |     20010 |
+--------------+-----------+-----------+

>>> py1["1"]    
Empty PyRanges
>>> py1["chr1"]
+--------------+-----------+-----------+
|   Chromosome |     Start |       End |
|   (category) |   (int32) |   (int32) |
|--------------+-----------+-----------|
|            1 |       100 |       150 |
+--------------+-----------+-----------+

>>> py1.join(py2)
Empty PyRanges

With the above code, chromosome name is formatted but the mapping of chromosome name in pyranges remains the same. Therefore, operations like join or query py1["1"] does not work as expected.

Is there a way to get the desired behavior using pyranges ?

This is a bug in pyranges. I have made an issue. Use `py1.Chromosome = py1.Chromosome.astype("str").str.replace("chr", "")` for now. It should work. — The Unfun Cat, Jul 31 '20 at 16:57
Created an issue here: https://github.com/biocore-ntnu/pyranges/issues/142 . Will fix after summer vacation :) — The Unfun Cat, Jul 31 '20 at 17:06

Marek Schwarz · Accepted Answer · 2020-08-01T06:17:17.263

2

The data in PyRanges class are stored in multiple places. Apart from .Chromosome, you have .dfs which is a dict. This keys in this dict are used when you do the py1["1"] call.

You need to also update the dict

>>> df1 = pd.DataFrame({"Chromosome": ["chr1", "chr2"], "Start": [100, 200],
                       "End": [150, 201]})
>>> py1 = pr.PyRanges(df1)
>>> py1.dfs["1"] = py1.dfs['chr1']
>>> del py1.dfs['chr1']
>>> py1["1"]

+--------------+-----------+-----------+
| Chromosome   |     Start |       End |
| (category)   |   (int32) |   (int32) |
|--------------+-----------+-----------|
| chr1         |       100 |       150 |
+--------------+-----------+-----------+
Unstranded PyRanges object has 1 rows and 3 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

Note that the name of the chromosome did not change in the table - it is because, as stated above, the data are stored in multiple places.

To be honest - I don't understand the PyRanges deeply and I have no idea if it is safe to update the data like this.

I strongly suggest to pre-process your data when they are in still in .bed format. This will ensure that the data are imported correctly to pyranges.

Edit 1/8/20: The answer is based on pre-bugfix behavior and may not be needed in the future.

edited Aug 01 '20 at 06:17

answered Jul 20 '20 at 07:05

Marek Schwarz

578
6
10

That was a great workaround. What would you suggest to pre-process the data ? Read the .bed files using pandas to modify the series, but there is no way to create pyranges object from pandas data frame, So I would have to save it to a file and read again using pyranges. – srgothi92 Jul 26 '20 at 21:05
`thre is no way to create pyranges object from pandas data frame` >> No, there is, in your example, you did just that (created pandas DataFrame and created pyranges object from it). For the preprocessing, you can use `awk` on linux, for pure-python simple `dict` lookup in for loop and writing to `TempFile` object would work also (If you don't wan't the changes to persist). – Marek Schwarz Jul 27 '20 at 06:46
Thanks for answering! Author here. All you need to do is `gr.Chromosome = "chr" + gr.Chromosome.astype("str")`. I downvoted, but will UV when you update. – The Unfun Cat Jul 31 '20 at 16:54
You should not have to know a pyranges contains dfs. The above is a bug :) – The Unfun Cat Jul 31 '20 at 17:01

How to convert chromosome name to same format in pyranges before performing a join

1 Answers1