0

I have two dfs and looking for an way to select (and count) rows of df1 based on rows in df2.

This is my df1:

      Chromosome  Start position  End position Reference Variant  reads  \
0       chr1       109419841     109419841         C       T      1
1       chr1       197008365     197008365         C       T      1

   variation reads  % variation                 gDNA nomencl  \
0                1          100  Chr1(GRCh37):g.109419841C>T
1                1          100  Chr1(GRCh37):g.197008365C>T

            cDNA nomencl    ...    exon transcript ID          inheritance  \
0  NM_013296.4:c.-258C>T    ...       2   NM_013296.4  Autosomal recessive
1  NM_001994.2:c.*143G>A    ...     UTR   NM_001994.2  Autosomal recessive

  test type                      Phenotype male coverage male ratio covered  \
0   Unknown  Deafness, autosomal recessief             0                  0
1   Unknown          Factor 13 deficientie             0                  0

  female coverage female ratio covered ratio M:F
0               1                    1       0.0
1               1                    1       0.0

df1 has these columns:

Chromosome                10561 non-null object
Start position            10561 non-null int64
End position              10561 non-null int64
Reference                 10415 non-null object
Variant                   10536 non-null object
reads                     10561 non-null int64
variation reads           10561 non-null int64
% variation               10561 non-null int64
gDNA nomencl              10561 non-null object
cDNA nomencl              10446 non-null object
protein nomencl           9997 non-null object
classification            10561 non-null object
status                    10561 non-null object
gene                      10560 non-null object
Sanger sequencing list    10561 non-null object
exon                      10502 non-null object
transcript ID             10460 non-null object
inheritance               8259 non-null object
test type                 10561 non-null object
Phenotype                 10380 non-null object
male coverage             10561 non-null int64
male ratio covered        10561 non-null int64
female coverage           10561 non-null int64
female ratio covered      10561 non-null int64

and this is df2:

 Chromosome  Startposition  Endposition    Bases  Meancoverage  \
0       chr1       11073785     11074022  27831.0    117.927966
1       chr1       11076901     11077064  11803.0     72.411043

   Mediancoverage  Ratiocovered>10X  Ratiocovered>20X Genename Componentnr  \
0            97.0               1.0               1.0   TARDBP           1
1            76.0               1.0               1.0   TARDBP           2

  PositionGenes          PositionGenome                       Position
0      TARDBP.1  chr1.11073785-11074022  comp.1_chr1.11073785-11074022
1      TARDBP.2  chr1.11076901-11077064  comp.2_chr1.11076901-11077064  

I want to select all rows from df1 that have in df2

  • the same value for 'Chromosome'
  • df1['Start position'] >= df2.Startposition
  • df1['End position'] <= df2.Endposition.

If these three criteria are met in the same row of df2, I want to select the corresponding row in df1.

I already fused the three columns 'Chromosome','Startposition' and 'Endposition' in 'PositionGenome' to generate a lambda function but coundn't come up with anything.

Thus, hope you can help me ...

SGeuer
  • 147
  • 1
  • 8
  • Please check this [answer](http://stackoverflow.com/a/34953669/2901002) – jezrael Feb 16 '17 at 09:31
  • @jezeral. If I try the answer you propose I get an memory error for pd.merge(df1,df2, on=['Chromosome']) . df1 has > 10.000 rows and df2 2 has > 6 million rows. I already reduced the dfs to the few column that are needed for the task but I still get the same error. – SGeuer Feb 16 '17 at 09:59
  • Exactly, there is problem in large dataframes... Unfortunately. – jezrael Feb 16 '17 at 10:00
  • I can concat them but that's not really the same, isn't it? – SGeuer Feb 16 '17 at 10:07
  • Hmmm, try `df1 = df1.set_index('Chromosome')`, then `df2 = df2.set_index('Chromosome')` and then `pd.concat([df1, df2])`. the best is test it in some small samples. – jezrael Feb 16 '17 at 10:15
  • that doesn't work because: I get NaN for the column that are not present in the other df thus that selecting for 'df[(df.StartA >= f.Startposition) & (df.StartA<=df.Endposition)]' leads to an empty df. and for 'pd.concat([df1,df2], axis=1) I get an shape error. – SGeuer Feb 16 '17 at 10:47

1 Answers1

0

A short updata: In the end I solved the problem with unix bedtools -wb. Still I would be glad if someone could come up with an python based solution.

SGeuer
  • 147
  • 1
  • 8
  • sorry, my last post was incomplete. this is the solution: bedtools intersect -a file1.bed -b file2.bed -wb – SGeuer Feb 22 '17 at 13:39