R: Subsetting df based on another column entry

Question

I assume I work on a simple problem regarding subsetting data based on another column value (here patient), but cannot find the solution. I need a data subset for each patient who has been in the hospital at least 4 times. In other words, only patients who have been in the hospital for at least 4 times shall be shown with their 4 visit rows in the new df. My table looks like this:

</style>
<table class="tg">
  <tr>
    <th class="tg-yw4l">Patient</th>
    <th class="tg-yw4l"># Hospital Visits</th>
    <th class="tg-yw4l">Duration</th>
  </tr>
  <tr>
    <td class="tg-yw4l">Monica</td>
    <td class="tg-yw4l">1</td>
    <td class="tg-yw4l">10D</td>
  </tr>
  <tr>
    <td class="tg-yw4l">Jack</td>
    <td class="tg-yw4l">1</td>
    <td class="tg-yw4l">5D</td>
  </tr>
  <tr>
    <td class="tg-yw4l">Monica</td>
    <td class="tg-yw4l">2</td>
    <td class="tg-yw4l">3D</td>
  </tr>
  <tr>
    <td class="tg-yw4l">Eric</td>
    <td class="tg-yw4l">1</td>
    <td class="tg-yw4l">2D</td>
  </tr>
  <tr>
    <td class="tg-yw4l">Eric</td>
    <td class="tg-yw4l">2</td>
    <td class="tg-yw4l">3D</td>
  </tr>
  <tr>
    <td class="tg-yw4l">Monica</td>
    <td class="tg-yw4l">3</td>
    <td class="tg-yw4l">4D</td>
  </tr>
  <tr>
    <td class="tg-yw4l">Jack</td>
    <td class="tg-yw4l">2</td>
    <td class="tg-yw4l">4D</td>
  </tr>
  <tr>
    <td class="tg-yw4l">Eric</td>
    <td class="tg-yw4l">3</td>
    <td class="tg-yw4l">8D</td>
  </tr>
  <tr>
    <td class="tg-yw4l">Eric</td>
    <td class="tg-yw4l">4</td>
    <td class="tg-yw4l">9D</td>
  </tr>
</table>

Thank you very much!

Wait, id your data an HTML table? Or do you have a proper data.frame in R. See [how to create a reproducible example in R](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for ways to include data in a question (don't use snippets for R code or data -- those are for HTML/javascript) — MrFlick, Mar 15 '17 at 22:22

Sathish · Answer 1 · 2017-03-15T22:41:21.003

df1 <- readHTMLTable(doc)[[1]]
colnames( df1 ) <- gsub("# ", '', colnames( df1 ))
df1$`Hospital Visits` <- as.numeric( df1$`Hospital Visits`)

df1
#   Patient Hospital Visits Duration
# 1  Monica               1      10D
# 2    Jack               1       5D
# 3  Monica               2       3D
# 4    Eric               1       2D
# 5    Eric               2       3D
# 6  Monica               3       4D
# 7    Jack               2       4D
# 8    Eric               3       8D
# 9    Eric               4       9D

Get only the event of the patient having visited the hospital atleast 4 times

with( df1, df1[ `Hospital Visits` >= 4, ] )
#   Patient  Hospital Visits Duration
# 9    Eric                4       9D

Get all events of a patient having visited the hospital atleast 4 times

do.call( 'rbind', lapply( split( df1, df1$Patient ), 
                          function( x ) if( any(x$'Hospital Visits' >= 4 ) ) { x }) )

#        Patient Hospital Visits Duration
# Eric.4    Eric               1       2D
# Eric.5    Eric               2       3D
# Eric.8    Eric               3       8D
# Eric.9    Eric               4       9D

Data:

library(XML)
doc <- htmlParse('<table class="tg">
                 <tr>
                 <th class="tg-yw4l">Patient</th>
                 <th class="tg-yw4l"># Hospital Visits</th>
                 <th class="tg-yw4l">Duration</th>
                 </tr>
                 <tr>
                 <td class="tg-yw4l">Monica</td>
                 <td class="tg-yw4l">1</td>
                 <td class="tg-yw4l">10D</td>
                 </tr>
                 <tr>
                 <td class="tg-yw4l">Jack</td>
                 <td class="tg-yw4l">1</td>
                 <td class="tg-yw4l">5D</td>
                 </tr>
                 <tr>
                 <td class="tg-yw4l">Monica</td>
                 <td class="tg-yw4l">2</td>
                 <td class="tg-yw4l">3D</td>
                 </tr>
                 <tr>
                 <td class="tg-yw4l">Eric</td>
                 <td class="tg-yw4l">1</td>
                 <td class="tg-yw4l">2D</td>
                 </tr>
                 <tr>
                 <td class="tg-yw4l">Eric</td>
                 <td class="tg-yw4l">2</td>
                 <td class="tg-yw4l">3D</td>
                 </tr>
                 <tr>
                 <td class="tg-yw4l">Monica</td>
                 <td class="tg-yw4l">3</td>
                 <td class="tg-yw4l">4D</td>
                 </tr>
                 <tr>
                 <td class="tg-yw4l">Jack</td>
                 <td class="tg-yw4l">2</td>
                 <td class="tg-yw4l">4D</td>
                 </tr>
                 <tr>
                 <td class="tg-yw4l">Eric</td>
                 <td class="tg-yw4l">3</td>
                 <td class="tg-yw4l">8D</td>
                 </tr>
                 <tr>
                 <td class="tg-yw4l">Eric</td>
                 <td class="tg-yw4l">4</td>
                 <td class="tg-yw4l">9D</td>
                 </tr>
                 </table>')

fleetmack · Answer 2 · 2017-03-15T22:31:00.327

Countless ways to do this, one simple way, though not the most efficient ....

Assuming you have this in a data frame, you could filter out the IDs (in this case, names) that have 4 or more. Then show all records for those names. I am naming your original dataframe my_df

who_to_include <- subset(unique(my_df$name),hospital_visits>=4)
library(dplyr)
4_or_more <- inner_join(who_to_include,my_df)

Sorry, no example to go off of here so I'm just winging this code here, may not be 100% right, or it might be

score 0 · Answer 3 · answered Mar 15 '17 at 22:43

Assuming you have this in a data frame and the content of column "Patient" specifies a patient uniquely (i.e. there are no multiple Erics), you could also subset it using base R only:

# Find row numbers of entries with number of visits >= 4
frequentPatientRows <- patientsDf[, "# Hospital Visits"] >= 4
# Extract names from those rows
frequentPatientNames <- patientsDf[frequentPatientRows, "Name"]
# Select all entries for patients with those names
selectedPatients <- patientsDf[patientsDf[, "Name"] %in% frequentPatientNames, ]

R: Subsetting df based on another column entry

3 Answers3