0

screenshotI have a table. The first row is id and the second row is value. The table has many rows. I want to have only unique id with one value. If the same id has both negative and positive value, I would select the positive value (1) and delete the negative value (0). See screenshot. I would be happy if you give me any solutions no matter of using R, Python or Postgresql. Thanks for your help!

hadm_id rass_v
100001    0
100003    0
100003    1
100006    0
100006    1
100007    0
100007    1
100009    0
100009    1
100010    0
100010    1
100011    0
100011    1
lancet
  • 33
  • 4

5 Answers5

1

in R

aggregate(rass_v~hadm_id, df, max)
#   hadm_id rass_v
# 1  100001      0
# 2  100003      1
# 3  100006      1
# 4  100007      1
# 5  100009      1
# 6  100010      1
# 7  100011      1
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • thanks! what this ~ means between rass_v and hadm_id? Is it similar to this answer https://stackoverflow.com/questions/14976331/use-of-tilde-in-r-programming-language? – lancet Feb 13 '19 at 21:42
  • Yes, it builds an object of class formula, the answer you linked focuses on regression but there are other uses. The function `aggregate` when provided with a formula as a first argument, reads it as "aggregate lhs by rhs" where you can use `+` on rhs to aggregate by more variables. – moodymudskipper Feb 14 '19 at 07:14
  • See: stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group/1661144 – moodymudskipper Feb 14 '19 at 07:26
  • Definitely the preferred way, mine was unnecessarily complicated. – ismirsehregal Feb 14 '19 at 09:50
0

You can use pandas for this:

import pandas as pd

d = {'hadmid': [1, 1, 2], 'rass_v': [3, -3, 4]}
df = pd.DataFrame(data=d)

boolean_mask = df.rass_v > 0
df = df[boolean_mask]
Gustav Rasmussen
  • 3,720
  • 4
  • 23
  • 53
  • Thanks Gustav! I have more than 70000 rows. Do In need to put all the id number in the iD array? – lancet Feb 13 '19 at 08:08
  • Let me clarify: the d object here is just a dictionary i manually created in order to demonstrate the functionality. You would instead of loading a dict object into the dataframe, directly load your table with a pandas read function (probably pd.read_table, or pd.read_csv) – Gustav Rasmussen Feb 13 '19 at 08:10
  • This will return a dataframe object without negative values. To completely eliminate duplicate id's you might want to also take the max value for each unique id, or similar. – Gustav Rasmussen Feb 13 '19 at 08:12
  • Thanks! I believe the max value is the right solution. Thank you very much! – lancet Feb 13 '19 at 08:16
0

Welcome to SO!

Here is one of several possibilities using R:

df <- data.frame(
     hadm_id = c(100001, 100003, 100003, 100006, 100006, 100007, 100007,
                 100009, 100009, 100010, 100010, 100011, 100011),
      rass_v = c(0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
)
# Edit: for better readability please use @Moody_Mudskipper's answer:    
# df <- setNames(aggregate(df$rass_v, by = list(df$hadm_id), max), names(df))
df <- aggregate(rass_v~hadm_id, df, max)

print(df)

See this for more.

Here is a faster data.table solution (for bigger tables):

library(data.table)
DT <- data.table(
     hadm_id = c(100001, 100003, 100003, 100006, 100006, 100007, 100007,
                 100009, 100009, 100010, 100010, 100011, 100011),
      rass_v = c(0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
)

DT <- DT[DT[, .I[which.max(rass_v)], by=hadm_id]$V1]
print(DT)

Please see this related question and Arun's answer.

Result:

   hadm_id rass_v
1:  100001      0
2:  100003      1
3:  100006      1
4:  100007      1
5:  100009      1
6:  100010      1
7:  100011      1

Edit: Here is the equivalent pandas way:

import pandas as pd
df = pd.DataFrame({'hadmid': [100001, 100003, 100003, 100006, 100006, 100007, 100007,
                 100009, 100009, 100010, 100010, 100011, 100011],
                 'rass_v': [0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]})

df = df.groupby(['hadmid'], sort=False)['rass_v'].max()

print(df)
ismirsehregal
  • 30,045
  • 5
  • 31
  • 78
0

I checked this and answered correctly.

DELETE FROM 'table_name' WHERE 'table_name'.hadm_id IN (
    select hadm_id from (
        SELECT count(hadm_id) c,
        hadm_id
        FROM 'table_name'
        GROUP BY hadm_id
        ) foo
    where foo.c = 2
    )
and rass_v=0
0

You can use tidyverse to easily get the result by grouping and then filtering by condition to leave only rows that aren't 0 cases where n = 1.

df %>% 
  group_by(hadm_id) %>%
  filter((n() > 1 & rass_v != 0) | (n() == 1))

Result:

print(df)

# A tibble: 7 x 2
# Groups:   hadm_id [7]
  hadm_id rass_v
    <dbl>  <dbl>
1  100001      0
2  100003      1
3  100006      1
4  100007      1
5  100009      1
6  100010      1
7  100011      1
nycrefugee
  • 1,629
  • 1
  • 10
  • 23