13

How do I search for a string in a data.frame? As a minimal example, how do I find the locations (columns and rows) of 'horse' in this data.frame?

> df = data.frame(animal=c('goat','horse','horse','two', 'five'), level=c('five','one','three',30,'horse'), length=c(10, 20, 30, 'horse', 'eight'))
> df
  animal level length
1   goat  five     10
2  horse   one     20
3  horse three     30
4    two    30  horse
5   five horse  eight

... so row 4 and 5 have the wrong order. Any output that would allow me to identify that 'horse' has shifted to the level column in row 5 and to the length column in row 4 is good. Maybe:

> magic_function(df, 'horse')
col       row
'animal', 2
'animal', 3
'length', 4
'level',  5

Here's what I want to use this for: I have a very large data frame (around 60 columns, 20.000 rows) in which some columns are messed up for some rows. It's too large to eyeball in order to identify the different ways that order can be wrong, so searching would be nice. I will use this info to move data to the correct columns for these rows.

Jonas Lindeløv
  • 5,442
  • 6
  • 31
  • 54
  • What order is correct, and what outcome do you want? – Tim Biegeleisen Sep 12 '16 at 12:11
  • @TimBiegeleisen I updated the question. By "location" I meant columns and rows (a data frame is 2D). I'm not asking about how to change order - that I know how to do. It was just to present the context of my problem. I guess there's no value in pasting my actual 60 columns since everybody would then ask for a minimal example ;-) But in my minimal example, let's say that the order of the first three rows is correct and the data is on the wrong columns in the row 4 and 5. – Jonas Lindeløv Sep 12 '16 at 12:16
  • 1
    re "hat 'horse' has shifted to the level column in row 5": there's no `horse`in row 5 of your example. I quit there. – Tensibai Sep 12 '16 at 12:26
  • @RonakShah, I've updated with an example output. – Jonas Lindeløv Sep 12 '16 at 12:30
  • @Tensibai, thanks for noticing. I've updated the example. – Jonas Lindeløv Sep 12 '16 at 12:30
  • @RonakShah, corrected – Jonas Lindeløv Sep 12 '16 at 12:36

4 Answers4

19

What about:

which(df == "horse", arr.ind = TRUE)
#      row col
# [1,]   2   1
# [2,]   3   1
# [3,]   5   2
# [4,]   4   3
thothal
  • 16,690
  • 3
  • 36
  • 71
  • Nice. Thinking about OP's problem, I would couple that with `count( df$animal)`, called for each column in the dataframe. This returns how many occurrences per level occur in the column, making it easier to detect outliers. To use `count` one has to `library(plyr)` first. – larsen Sep 12 '16 at 12:49
  • 2
    @larsen, `colSums(df=='horse')` is a neater way to do that. – Jonas Lindeløv Sep 12 '16 at 12:52
  • 1
    @Jonas thanks. But I think that would solve a slightly different problem. Maybe I misunderstood the question, but I guess that with malformed data one would not know in advance what column values are misplaced. The technique I proposed aims to detect those wrong values. – larsen Sep 12 '16 at 12:55
7

Another way around:

l <- sapply(colnames(df), function(x) grep("horse", df[,x]))

$animal
[1] 2 3

$level
[1] 5

$length
[1] 4

If you want the output to be matrix:

sapply(l,'[',1:max(lengths(l)))

     animal level length
[1,]      2     5      4
[2,]      3    NA     NA
989
  • 12,579
  • 5
  • 31
  • 53
1

We can get the indices where the value is equal to horse. Divide it by number of rows (nrow) to get the column indices and by columns (ncol) to get the row indices.

We use colnames to get column names instead of indices.

data.frame(col = colnames(df)[floor(which(df == "horse") / (nrow(df) + 1)) + 1], 
           row = floor(which(df == "horse") / ncol(df)) + 1)

#   col   row
#1 animal   1
#2 animal   2
#3  level   4
#4 length   5
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Another way to do it is the following:

library(data.table)
library(zoo)
library(dplyr)
library(timeDate)
library(reshape2)
data frame name = tbl_account

first,Transpose it :

temp = t(tbl_Account)

Then, put it in to a list :

temp = list(temp)

This essentially puts every single observation in a data frame in to one massive string, allowing you to search the whole data frame in one go.

then do the searching :

temp[[1]][grep("Horse",temp[[1]])] #brings back the actual value occurrences
grep("Horse", temp[[1]]) # brings back the position of the element in a list it occurs in 

hope this helps :)

thothal
  • 16,690
  • 3
  • 36
  • 71
piyuw
  • 61
  • 1
  • 6