3

sqldf has a limit option to get 'X' rows. Can we also do a 'x%' sample using sqldf?

e.g.

> sqldf("select * from iris limit 3")

Loading required package: tcltk
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

> sqldf("select * from iris sample 0.01")
Error in sqliteSendQuery(con, statement, bind.data) : 
  error in statement: near ".1": syntax error

Is there some workaround for this?

Manish

myloginid
  • 1,463
  • 2
  • 22
  • 37
  • This is a completely valid question, but is there a reason you are using `sqldf` in `R`? Why not pure SQL? Or, if you like, why would another solution in R not be suitable here? – Hugh Sep 03 '15 at 12:47
  • 1
    As the question is a sql question, possible duplicate of [this one](http://stackoverflow.com/questions/5615172/mysql-limit-by-a-percentage-of-the-amount-of-records) – Tensibai Sep 03 '15 at 13:53
  • @Hugh - I come from SQL background. So much more comfortable working in sqldf to alter tables/subselect cols and rows/group bys/havings etc. Hence want to do all that in SQL and then Sample rows. – myloginid Sep 04 '15 at 03:24
  • See example 6e on the sqldf github page: https://github.com/ggrothendieck/sqldf – G. Grothendieck Jan 07 '17 at 11:45

3 Answers3

6

This is basically a SQL question. As sqldf uses SQLite by default (this is not absolutely correct, see the documentation for drv in ?sqldf for more details), the question boils down to "How to select rows in random order in SQLite?". A simple answer can be found in this SO post:

SELECT * FROM table ORDER BY RANDOM() LIMIT 1;

Using the example from the question:

> sqldf("select * from iris order by RANDOM() limit 3")
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          6.3         3.3          4.7         1.6 versicolor
2          6.0         2.7          5.1         1.6 versicolor
3          5.0         2.0          3.5         1.0 versicolor

EDIT: If the percentage instead of the number of rows is given, the following could be used as a R solution. Pure SQLite solutions may exist.

percentage <- 0.02
mylimit <- round(nrow(iris) * percentage, 0)
sqldf(sprintf("select * from iris order by RANDOM() limit %d", mylimit))
Community
  • 1
  • 1
CL.
  • 14,577
  • 5
  • 46
  • 73
  • This works fine. Just that when doing a group by or having clauses the nbr of rows can vary so we need 2 steps - one to create the calculated data frame and another to do the x% sampling once we have the rowcount. – myloginid Sep 04 '15 at 03:28
1

Not very elegant (SQL is not my strong point) but it works.

library(sqldf)
totalrows <- sqldf("SELECT COUNT(*) AS NumberOfOrders FROM iris") 

so....

10*totalrows[[1]]/100

get the "number" and put here:

sqldf("SELECT * FROM iris LIMIT number")

In any case, it is a question of sql rather than R

PereG
  • 1,796
  • 2
  • 22
  • 23
0

Using 6e from the sqldf home page we can get 10% of the records randomly from iris using SQL like this:

library(sqldf)

sqldf("select * from iris order by random(*) limit 0.10 * (select count(*) from iris)")

To parameterize by the fraction try this:

frac <- 0.10
fn$sqldf("select * from iris order by random(*) limit $frac * (select count(*) from iris)")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341