I have a pandas dataframe and I need to calculate the sum of a column of values that fall within a certain window. So for instance, if I have a window of 500, and my initial value is 1000, I want to sum all values that are between 499 and 999, and also between 1001 and 1501.
This is easier to explain with some data:
chrom pos end AFR EUR pi
0 1 10177 10177 0.4909 0.4056 0.495988
1 1 10352 10352 0.4788 0.4264 0.496369
2 1 10617 10617 0.9894 0.9940 0.017083
3 1 11008 11008 0.1346 0.0885 0.203142
4 1 11012 11012 0.1346 0.0885 0.203142
5 1 13110 13110 0.0053 0.0567 0.053532
6 1 13116 13116 0.0295 0.1869 0.176091
7 1 13118 13118 0.0295 0.1869 0.176091
8 1 13273 13273 0.0204 0.1471 0.139066
9 1 13550 13550 0.0008 0.0080 0.007795
10 1 14464 14464 0.0144 0.1859 0.161422
11 1 14599 14599 0.1210 0.1610 0.238427
12 1 14604 14604 0.1210 0.1610 0.238427
13 1 14930 14930 0.4811 0.5209 0.500209
14 1 14933 14933 0.0015 0.0507 0.044505
15 1 15211 15211 0.5371 0.7316 0.470848
16 1 15585 15585 0.0008 0.0020 0.002635
17 1 15644 15644 0.0008 0.0080 0.007795
18 1 15777 15777 0.0159 0.0149 0.030470
19 1 15820 15820 0.4849 0.2714 0.477153
20 1 15903 15903 0.0431 0.4652 0.349452
21 1 16071 16071 0.0091 0.0010 0.011142
22 1 16142 16142 0.0053 0.0020 0.007721
23 1 16949 16949 0.0227 0.0159 0.038759
24 1 18643 18643 0.0023 0.0080 0.009485
25 1 18849 18849 0.8411 0.9911 0.170532
26 2 30923 30923 0.6687 0.9364 0.338400
27 2 20286 46286 0.0053 0.0010 0.006863
28 2 21698 46698 0.0015 0.0010 0.002566
29 2 42159 47159 0.0083 0.0696 0.067187
So I need to subset based on the first two columns. For example, if my window = 500, my chrom = 1 and my pos = 15500, I will need to subset my df to include only those rows that have chrom = 1 and 15000 > pos < 16000.
I would then like to sum the AFR column of this subset of data.
Here is the function I have made:
#vdf is my main dataframe,
#polyChrom is the chromosome to subset by,
#polyPos is the position to subset by.
#Distance is how far the window should be from the polyPos.
#windowSize is the size of the window itself
#E.g. if distance=20000 and windowSize= 500, we are looking at a window
#that is (polyPos-20000)-500 to (polyPos-20000) and a window that is
#(polyPos+20000) to (polyPos+20000)+500.
def mafWindow(vdf, polyChrom, polyPos, distance, windowSize):
#If start position becomes less than 0, set it to 0
if(polyPos - distance < 0):
start1 = 0
end1 = windowSize
else:
start1 = polyPos - distance
end1 = start1 + windowSize
end2 = polyPos + distance
start2 = end2 - windowSize
#subset df
df = vdf.loc[(vdf['chrom'] == polyChrom) & ((vdf['pos'] <= end1) & (vdf['pos'] >= start1))|
((vdf['pos'] <= end2) & (vdf['pos'] >= start2))].copy()
return(df.AFR.sum())
This whole method works on subsetting the dataframe and is very slow when my dataframe contains ~55k rows. Is there a quicker and more efficient way of doing this?