conditional math operations between rows

Question

Reposting this after receiving down vote, did went back and try something but i guess still not there yet.

File with data which looks like this:

name    count   count1  count3  add1    add2
jack    70  55  31  100174766   100170715
jack    45  656 48  100174766   100174052
john    41  22  89  102268764   102267805
john    47  31  63  102268764   102267908
david   10  56  78  103361093   103368592

two conditions that i need to check and one math operation which need to be done later: A) which rows/lines have duplicate values in add1 ( always == 2) B) if they are equal to 2, which line/row has a greater value in add2

lets take jack for example:

jack    70  55  31  100174766   100170715
jack    45  656 48  100174766   100174052

jack has two add1 == 2 ( occurs twice) and 100174052 is greater so:

row1 = jack 45  656 48  100174766   100174052
row2 = jack 70  55  31  100174766   100170715

Math:

for each cell between both the rows row1 /(row1+row2)

output for jack :

jack    0.391304348 0.922644163 0.607594937 100174766   100174052

final desired output

name    count   count1  count3  add1    add2
jack    0.391304348 0.922644163 0.607594937 100174766   100174052
john    0.534090909 0.58490566  0.414473684 102268764   102267908

code so far:

I know i have not accounted for which add2 is greater not sure where and how to do it

info = []
with open('file.tsv', 'r') as j:
    for i,line in enumerate(j):
        lines = line.strip().split('\t')
        info.append(lines)

uniq = {}
for index,row in enumerate(info, start =1):
    if row.count(row[4]) == 2:
       key = row[4] + ':' + row[5]
    if key not in uniq:
        uniq[key] = row[1:3]

for k, v in sorted(uniq.iteritems()):
    row1 = k,v
    row2 = k,v
    print 'row1: ', row1[0], '\n', 'row2: ',row2[0]

all i see is:

row1:  100174766:100170715 
row2:  100174766:100170715
row1:  100174766:100174052 
row2:  100174766:100174052

instead of

row1:  100174766:100170715
row2:  100174766:100174052

is there any other non-"Jack" with the same values in `add1` ? — Patrick Artner, Jun 16 '18 at 09:23

score 1 · Accepted Answer · answered Jun 16 '18 at 01:49

1

(dat.sort_values('add2',ascending=[False]).groupby(['name','add1']).aggregate(lambda x: (x.iloc[0]/sum(x))))

                    count    count1    count3      add2
name  add1                                             
david 103361093  1.000000  1.000000  1.000000  1.000000
jack  100174766  0.391304  0.922644  0.607595  0.500008
john  102268764  0.534091  0.584906  0.414474  0.500000

answered Jun 16 '18 at 01:49

Onyambu

67,392
3
24
53

hi Onyambu, appreciate your answer, can this be done using pure python? just curious and learning the basics – novicebioinforesearcher Jun 16 '18 at 05:05
@novicebioinforesearcher What do you mean by pure python?. Or you meant R? – Onyambu Jun 17 '18 at 04:20
I meant write python code with using pandas, your answer works too. – novicebioinforesearcher Jun 17 '18 at 04:39

Patrick Artner · Answer 2 · 2018-06-16T11:27:04.043

Anything pandas can do, can be done with pure python - just more code needed:

To make it a full minimal verifyable complete example that runs f.e. inside https://pyfiddle.io you need to create the file:

# create file
with open("d.txt","w") as f:
    f.write("""name    count   count1  count3  add1    add2
jack    70  55  31  100174766   100170715
jack    45  656 48  100174766   100174052
john    41  22  89  102268764   102267805
john    47  31  63  102268764   102267908
david   10  56  78  103361093   103368592""")

That out ouf the way, I define some helpers:

def printMe(gg):
    """Pretty prints a dictionary"""
    print ""
    for k in gg:
        print k, "\t:  ", gg[k]

def spaceEm(s):
    """Returns a string of input s with 2 spaces prepended"""
    return "  {}".format(s)

and start reading in and computing your values:

data = {}
with open("d.txt","r") as f:
    headers = f.readline().split() # store header line for later
    for line in f:
        if line.strip(): # just a guard against empty lines
            # name, *splitted = line.split() # python 3.x, you specced 2.7
            tmp = line.split()
            name = tmp[0]
            splitted = tmp[1:]
            nums = list(map(float,splitted))
            data.setdefault((name,nums[3]),[]).append(nums)
printMe(data)

# sort data
for nameAdd1 in data:
    # name     :  count   count1  count3  add1    add2 
    data[nameAdd1].sort(key = lambda x: -x[4]) # - "trick" to sort descending, you 
                                               # could use reverse=True instead 
printMe(data)


# calculate stuff and store in result
result = {}
for nameAdd1 in data:
    try:
        values = zip(*data[nameAdd1])

        # this results in value error if you can not decompose in r1,r2
        result[nameAdd1] = [r1 / (r1+r2) for r1,r2 in values]

    except ValueError:
        # this catches the case of only 1 value for a person 
        result[nameAdd1] = data[nameAdd1][0]
printMe(result)


# store as resultfile (will be overwritten each time)
with open("d2.txt","w") as f:
    # header
    f.write(headers[0])
    for h in headers[1:]:
        f.write(spaceEm(h))
    f.write("\n")

    # data
    for key in result:
        f.write(key[0]) # name
        for t in map(spaceEm,result[key]):
            f.write(t) # numbers
        f.write("\n")

Output:

# read from file
('jack', 100174766.0)   :   [[70.0, 55.0, 31.0, 100174766.0, 100170715.0], [45.0, 656.0, 48.0, 100174766.0, 100174052.0]]
('david', 103361093.0)  :   [[10.0, 56.0, 78.0, 103361093.0, 103368592.0]]
('john', 102268764.0)   :   [[41.0, 22.0, 89.0, 102268764.0, 102267805.0], [47.0, 31.0, 63.0, 102268764.0, 102267908.0]]

# sorted by add1
('jack', 100174766.0)   :   [[45.0, 656.0, 48.0, 100174766.0, 100174052.0], [70.0, 55.0, 31.0, 100174766.0, 100170715.0]]
('david', 103361093.0)  :   [[10.0, 56.0, 78.0, 103361093.0, 103368592.0]]
('john', 102268764.0)   :   [[47.0, 31.0, 63.0, 102268764.0, 102267908.0], [41.0, 22.0, 89.0, 102268764.0, 102267805.0]]

# result of calculations
('jack', 100174766.0)   :   [0.391304347826087, 0.9226441631504922, 0.6075949367088608, 0.5, 0.5000083281436545]
('david', 103361093.0)  :   [10.0, 56.0, 78.0, 103361093.0, 103368592.0]
('john', 102268764.0)   :   [0.5340909090909091, 0.5849056603773585, 0.4144736842105263, 0.5, 0.5000002517897694]

Input file:

name    count   count1  count3  add1    add2
jack    70  55  31  100174766   100170715
jack    45  656 48  100174766   100174052
john    41  22  89  102268764   102267805
john    47  31  63  102268764   102267908
david   10  56  78  103361093   103368592

Output file:

name  count  count1  count3  add1  add2
jack  0.391304347826087  0.9226441631504922  0.6075949367088608  0.5  0.5000083281436545
john  0.5340909090909091  0.5849056603773585  0.4144736842105263  0.5  0.5000002517897694
david  10.0  56.0  78.0  103361093.0  103368592.0

Disclaimer: I coded in 3.x and fixed it to 2.7 in http://pyfiddle.io afterwards, there might be some "unneeded" intermediary variables to make it work...

that works perfect, do you still have py3 version for it i should i actually using py3 since py27 is fading off — novicebioinforesearcher, Jun 16 '18 at 14:19
also i was unaware of https://pyfiddle.io this is great for me to ask questions.. — novicebioinforesearcher, Jun 16 '18 at 14:20
@novicebioinforesearcher the python3 code used `name, *splitted = line.split()` decomposition (its still in the code, just commented). You would have to change the `print` - statements to use `( .... )` as print is a function in python 3.x - that were the major modifications. `map` and `zip` asyoused in this code did need no changes (they are iterated over once), so no need to use list around them (they return lists in 2.7 and generators in 3.x - so you need to put them inside lists if you want to use them multiple times). — Patrick Artner, Jun 16 '18 at 14:23
@novicebioinforesearcher As for pyfiddle.io -. just make sure to switch 2.7/3.6 correctly and familarize yourself with its quirks (f.e. you need to create files inside the code first to operate on files) regarding parameter passing and input() value passing. you need to put stuff in the inputfields first. If you can choose 2.7 or 3.6 I would use 3.6 - 2.7 is going out of business in 2020 and almost all things are ported, so no need to stay 2.x-ish — Patrick Artner, Jun 16 '18 at 14:27
aah gotcha, i think i did not get this part `so no need to use list around them (they return lists in 2.7 and generators in 3.x` could you please just comment out in the answer — novicebioinforesearcher, Jun 16 '18 at 14:29
@novicebioinforesearcher read up in [this question](https://stackoverflow.com/questions/13638898/how-to-use-filter-map-and-reduce-in-python-3) that showcases some of the difference of certain built-ins between usage in 2.x and 3.x — Patrick Artner, Jun 16 '18 at 14:40

conditional math operations between rows

Math:

output for jack :

final desired output

code so far:

2 Answers2