I have two dfs and looking for an way to select (and count) rows of df1 based on rows in df2.
This is my df1:
Chromosome Start position End position Reference Variant reads \
0 chr1 109419841 109419841 C T 1
1 chr1 197008365 197008365 C T 1
variation reads % variation gDNA nomencl \
0 1 100 Chr1(GRCh37):g.109419841C>T
1 1 100 Chr1(GRCh37):g.197008365C>T
cDNA nomencl ... exon transcript ID inheritance \
0 NM_013296.4:c.-258C>T ... 2 NM_013296.4 Autosomal recessive
1 NM_001994.2:c.*143G>A ... UTR NM_001994.2 Autosomal recessive
test type Phenotype male coverage male ratio covered \
0 Unknown Deafness, autosomal recessief 0 0
1 Unknown Factor 13 deficientie 0 0
female coverage female ratio covered ratio M:F
0 1 1 0.0
1 1 1 0.0
df1 has these columns:
Chromosome 10561 non-null object
Start position 10561 non-null int64
End position 10561 non-null int64
Reference 10415 non-null object
Variant 10536 non-null object
reads 10561 non-null int64
variation reads 10561 non-null int64
% variation 10561 non-null int64
gDNA nomencl 10561 non-null object
cDNA nomencl 10446 non-null object
protein nomencl 9997 non-null object
classification 10561 non-null object
status 10561 non-null object
gene 10560 non-null object
Sanger sequencing list 10561 non-null object
exon 10502 non-null object
transcript ID 10460 non-null object
inheritance 8259 non-null object
test type 10561 non-null object
Phenotype 10380 non-null object
male coverage 10561 non-null int64
male ratio covered 10561 non-null int64
female coverage 10561 non-null int64
female ratio covered 10561 non-null int64
and this is df2:
Chromosome Startposition Endposition Bases Meancoverage \
0 chr1 11073785 11074022 27831.0 117.927966
1 chr1 11076901 11077064 11803.0 72.411043
Mediancoverage Ratiocovered>10X Ratiocovered>20X Genename Componentnr \
0 97.0 1.0 1.0 TARDBP 1
1 76.0 1.0 1.0 TARDBP 2
PositionGenes PositionGenome Position
0 TARDBP.1 chr1.11073785-11074022 comp.1_chr1.11073785-11074022
1 TARDBP.2 chr1.11076901-11077064 comp.2_chr1.11076901-11077064
I want to select all rows from df1 that have in df2
- the same value for 'Chromosome'
- df1['Start position'] >= df2.Startposition
- df1['End position'] <= df2.Endposition.
If these three criteria are met in the same row of df2, I want to select the corresponding row in df1.
I already fused the three columns 'Chromosome','Startposition' and 'Endposition' in 'PositionGenome' to generate a lambda function but coundn't come up with anything.
Thus, hope you can help me ...