Select positive value from id when the same id has both positive and negative value

Question

screenshotI have a table. The first row is id and the second row is value. The table has many rows. I want to have only unique id with one value. If the same id has both negative and positive value, I would select the positive value (1) and delete the negative value (0). See screenshot. I would be happy if you give me any solutions no matter of using R, Python or Postgresql. Thanks for your help!

hadm_id rass_v
100001    0
100003    0
100003    1
100006    0
100006    1
100007    0
100007    1
100009    0
100009    1
100010    0
100010    1
100011    0
100011    1

score 1 · Answer 1 · answered Feb 13 '19 at 21:31

1

in R

aggregate(rass_v~hadm_id, df, max)
#   hadm_id rass_v
# 1  100001      0
# 2  100003      1
# 3  100006      1
# 4  100007      1
# 5  100009      1
# 6  100010      1
# 7  100011      1

answered Feb 13 '19 at 21:31

moodymudskipper

46,417
11
121
167

thanks! what this ~ means between rass_v and hadm_id? Is it similar to this answer https://stackoverflow.com/questions/14976331/use-of-tilde-in-r-programming-language? – lancet Feb 13 '19 at 21:42
Yes, it builds an object of class formula, the answer you linked focuses on regression but there are other uses. The function `aggregate` when provided with a formula as a first argument, reads it as "aggregate lhs by rhs" where you can use `+` on rhs to aggregate by more variables. – moodymudskipper Feb 14 '19 at 07:14
See: stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group/1661144 – moodymudskipper Feb 14 '19 at 07:26
Definitely the preferred way, mine was unnecessarily complicated. – ismirsehregal Feb 14 '19 at 09:50

Gustav Rasmussen · Answer 2 · 2019-02-13T08:08:48.820

0

You can use pandas for this:

import pandas as pd

d = {'hadmid': [1, 1, 2], 'rass_v': [3, -3, 4]}
df = pd.DataFrame(data=d)

boolean_mask = df.rass_v > 0
df = df[boolean_mask]

edited Feb 13 '19 at 08:08

answered Feb 13 '19 at 08:00

Gustav Rasmussen

3,720
4
23
53

Thanks Gustav! I have more than 70000 rows. Do In need to put all the id number in the iD array? – lancet Feb 13 '19 at 08:08
Let me clarify: the d object here is just a dictionary i manually created in order to demonstrate the functionality. You would instead of loading a dict object into the dataframe, directly load your table with a pandas read function (probably pd.read_table, or pd.read_csv) – Gustav Rasmussen Feb 13 '19 at 08:10
This will return a dataframe object without negative values. To completely eliminate duplicate id's you might want to also take the max value for each unique id, or similar. – Gustav Rasmussen Feb 13 '19 at 08:12
Thanks! I believe the max value is the right solution. Thank you very much! – lancet Feb 13 '19 at 08:16

ismirsehregal · Accepted Answer · 2019-02-14T09:50:48.093

Welcome to SO!

Here is one of several possibilities using R:

df <- data.frame(
     hadm_id = c(100001, 100003, 100003, 100006, 100006, 100007, 100007,
                 100009, 100009, 100010, 100010, 100011, 100011),
      rass_v = c(0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
)
# Edit: for better readability please use @Moody_Mudskipper's answer:    
# df <- setNames(aggregate(df$rass_v, by = list(df$hadm_id), max), names(df))
df <- aggregate(rass_v~hadm_id, df, max)

print(df)

See this for more.

Here is a faster data.table solution (for bigger tables):

library(data.table)
DT <- data.table(
     hadm_id = c(100001, 100003, 100003, 100006, 100006, 100007, 100007,
                 100009, 100009, 100010, 100010, 100011, 100011),
      rass_v = c(0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
)

DT <- DT[DT[, .I[which.max(rass_v)], by=hadm_id]$V1]
print(DT)

Please see this related question and Arun's answer.

Result:

   hadm_id rass_v
1:  100001      0
2:  100003      1
3:  100006      1
4:  100007      1
5:  100009      1
6:  100010      1
7:  100011      1

Edit: Here is the equivalent pandas way:

import pandas as pd
df = pd.DataFrame({'hadmid': [100001, 100003, 100003, 100006, 100006, 100007, 100007,
                 100009, 100009, 100010, 100010, 100011, 100011],
                 'rass_v': [0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]})

df = df.groupby(['hadmid'], sort=False)['rass_v'].max()

print(df)

Mohammad Hossein Shojaeinia · Answer 4 · 2019-02-13T08:47:13.493

0

I checked this and answered correctly.

DELETE FROM 'table_name' WHERE 'table_name'.hadm_id IN (
    select hadm_id from (
        SELECT count(hadm_id) c,
        hadm_id
        FROM 'table_name'
        GROUP BY hadm_id
        ) foo
    where foo.c = 2
    )
and rass_v=0

edited Feb 13 '19 at 08:47

answered Feb 13 '19 at 08:20

Mohammad Hossein Shojaeinia

426
4
13

score 0 · Answer 5 · answered Feb 13 '19 at 13:35

You can use tidyverse to easily get the result by grouping and then filtering by condition to leave only rows that aren't 0 cases where n = 1.

df %>% 
  group_by(hadm_id) %>%
  filter((n() > 1 & rass_v != 0) | (n() == 1))

Result:

print(df)

# A tibble: 7 x 2
# Groups:   hadm_id [7]
  hadm_id rass_v
    <dbl>  <dbl>
1  100001      0
2  100003      1
3  100006      1
4  100007      1
5  100009      1
6  100010      1
7  100011      1

Select positive value from id when the same id has both positive and negative value

5 Answers5