1

I have a MySQL database with many large tables in the following format:

mysql> select * from Table1 limit 2;
+-------+----------+-------------+
| chrom | site     | methylation |
+-------+----------+-------------+
|     1 | 10003581 |          76 |
|     1 | 10003584 |           0 |
+-------+----------+-------------+

I would like to create one large merged table in R that will contain all the sites covered with the methylation values for each table. For instance, if I had 4 mysql tables the R data frame would contain the following columns:

chrom    site    table1    table2    table3    table4

So far I have:

library(RMySQL)

#Open database
mydb = dbConnect(MySQL(), user='root', password='', dbname='DataBase')

#Create function to get values
GetVal <- function(TableName, ColumnName){
  rs = dbSendQuery(mydb, paste("SELECT chrom, site, methylation FROM ", TableName))
  data = fetch(rs, n=-1)
  res <- rename(data, c("chrom" = "Chr", "site" = "start", "methylation" = ColumnName))
  return(res)
}

Table1 <- GetVal("Table1", "Table1")
Table2 <- GetVal("Table2", "Table2")
Table3 <- GetVal("Table3", "Table3")
Table4 <- GetVal("Table4", "Table4")

I'd then merge all the tables together. However I figure there should be a faster and more efficient way of doing this.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
user2165857
  • 2,530
  • 7
  • 27
  • 39

2 Answers2

2

This would be more general, assuming that the number of tables you're dealing with is variable. It also renames the columns the way that you've wanted in the original function:

library(RMySQL)

##  Open database:
mydb = dbConnect(MySQL(), user='root', password='', dbname='DataBase')

##  Create function to get values:
GetVals <- function(TableNames) {
    query <- paste0("SELECT ", Tables[1], ".Chr AS chrom, ", Tables[1], ".start AS site, ")
    query <- paste0(query, paste0(Tables, ".methylation AS ", Tables, collapse=", "))
    query <- paste0(query, " FROM ", Tables[1], paste0(" JOIN ", Tables[-1], " ON ", Tables[1], ".Chr=", Tables[-1], ".Chr AND ", Tables[1], ".start=", Tables[-1], ".start", collapse=""))

  rs <- dbSendQuery(mydb, query)
  data <- fetch(rs, n=-1)
  return(data)
}

Tables <- c("Table1", "Table2", "Table3", "Table4")

my_data <- GetVals(Tables)

This is the query produced for the Tables variable above:

> query
[1] "SELECT Table1.Chr AS chrom, Table1.start AS site, Table1.methylation AS Table1, Table2.methylation AS Table2, Table3.methylation AS Table3, Table4.methylation AS Table4 FROM Table1 JOIN Table2 ON Table1.Chr=Table2.Chr AND Table1.start=Table2.start JOIN Table3 ON Table1.Chr=Table3.Chr AND Table1.start=Table3.start JOIN Table4 ON Table1.Chr=Table4.Chr AND Table1.start=Table4.start"
Forrest R. Stevens
  • 3,435
  • 13
  • 21
0

Try this

dbSendQuery(mydb, 'insert into chrom_sites
select distinct chrom,site from table1
union
select distinct chrom,site from table2
union
select distinct chrom,site from table3
union
select distinct chrom,site from table4
union
select distinct chrom,site from table5')

x <- dbSendQuery(mydb, 'select chrom,
site,
t1.methylation as table1,
t2.methylation as table2,
t3.methylation as table3,
t4.methylation as table4,
t5.methalation as table5
from chrom_sites as a
join table1 as t1 on a.chrom = t1.chrom and a.site = t1.site
join table2 as t2 on a.chrom = t2.chrom and a.site = t2.site
join table3 as t3 on a.chrom = t3.chrom and a.site = t3.site
join table4 as t4 on a.chrom = t4.chrom and a.site = t4.site
join table5 as t5 on a.chrom = t5.chrom and a.site = t5.site')

What this should do is create a chrom_sites table in MySQL that contains the unique values of chrom and site.

After that, it uses that as a starting point to then populate the table (data frame) in the manner you want.

There might be a better way to do the first part, but I'm not sure. If you have a lot of tables, it might make sense to write a function to do this.

Ken Yeoh
  • 876
  • 6
  • 11