4

I have a data set of the form enter image description here

that I would like to change to this form below in R using SQL. enter image description here

I know that I could do this daily simply with dplyr but the point here is to learn to use SQL to create and manipulate a small relational database.

  • Price needs to be turned into a numeric value. Removing the "R" and spaces in between.

  • coordinates needs to be turned into 2 coordinates Long and Lat

  • floor size needs to be turned into a numeric from a string removing the space and "m^2" at the end.

Minimum working example

# Data to copy into sheet

       Price                            coordinates floor.size surburb       date
 R 1 750 000 -33.93082074573843, 18.857342125467635      68 m²     Jhb 2021-06-24
 R 1 250 000 -33.930077157927855, 18.85420954236195      56 m²     Jhb 2021-06-17
 R 2 520 000 -33.92954929205658, 18.857504799977896      62 m²     Jhb 2021-06-24

Code to manipulate in R markdown

```{r}
#install.packages("RSQLite", repos = "http://cran.us.r-project.org")

library(readxl)
library(dplyr)
library(RSQLite)
library(DBI)
library(knitr)

db <- dbConnect(RSQLite::SQLite(), ":memory:")

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(connection = "db")


# Import data
dataH <- read_excel("C:/Users/Dell/Desktop/exampledata.xlsx")

``` 

```{sql, connection = db}
# SQL code passed directly
```

Edit 1:

The answer by @Onyambu works almost. It is producing an error with the coordinates. For example in the image below the last two coordinates are supposed to have a Long that starts with '18.85' instead of '.85' when the coordinate was "-33.930989501123, 18.857270308516927". How would I fix this?

enter image description here

user849541
  • 176
  • 8
  • Have you looked at [`sqldf`](https://cran.r-project.org/web/packages/sqldf/index.html)? It allows SQL queries against a `data.frame`(s). – r2evans Jun 27 '21 at 19:04
  • Please post your data using `dput(x)`, since as-is we cannot just copy and paste without a bit of manual extract. (The embedded spaces make it so that `read.table` and family cannot just parse it.) – r2evans Jun 27 '21 at 19:06
  • Also you need to specificly state what db engine you are using. mysql, sqllite, postgresql etc all have different functions to be implemented. – Onyambu Jun 27 '21 at 19:58

2 Answers2

2

Using the basic sql functions, you could do:

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,connection = "db")
```

```{r}
db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

txt <- "Price coordinates floor.size surburb date\n
     'R 1 750 000' '-33.93082074573843, 18.857342125467635' '68 m²' Jhb 2021-06-24\n
     'R 1 250 000' '-33.930077157927855, 18.85420954236195' '56 m²' Jhb 2021-06-17\n
     'R 2 520 000' '-33.92954929205658, 18.857504799977896' '62 m²' Jhb 2021-06-24"

dataH <- read.table(text = txt, header = TRUE) 
DBI::dbWriteTable(db, 'dataH', dataH)
```


```{sql}
SELECT REPLACE(SUBSTRING(price, 3, 100), ' ', '') price,
       replace(SUBSTRING(coordinates, 1, 20), ',', '') Lat,
       SUBSTRING(coordinates, 21, 255) Long,
       SUBSTRING(`floor.size`, 1, 2) floor_size,
       surburb,
       date
FROM dataH
```
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • The actual data is much bigger than 4 rows and has more columns than those that I specified. Could this be written in such a way that I import the data as an R object and then perform the manipulations as I go along(manipulating columns)? @Onyambu – user849541 Jun 27 '21 at 22:18
  • @user849541 i am not only using the 4 rows but everything. You need the SQL CODE. The R code was just for data replication. On your code, once you have imported the data as dataH, then do `dbWriteTable` as in my code above, followed by the sql code above – Onyambu Jun 27 '21 at 22:25
  • I am getting the following error: Error in (function (sql, outputFile, options) : The 'connection' option (DBI connection) is required for sql chunks. – user849541 Jun 27 '21 at 23:02
  • 2
    For last chunk, use: `{sql, connection = db}` – Parfait Jun 27 '21 at 23:09
  • 1
    Hi @Onyambu, there is something that is giving issues with the coordinates. I added it into the question as an edit. – user849541 Jun 27 '21 at 23:23
  • 1
    Perhaps @Parfait might know how to fix the issue with the coordinates transformation? – user849541 Jun 27 '21 at 23:28
  • 1
    Could it be that the decimals are not the same length per say @Onyambu – user4933 Jun 27 '21 at 23:36
  • 1
    @Parfait I placed `connection = db` within the setup chunck, so not really necessary to include it again, but no problem if included – Onyambu Jun 28 '21 at 03:47
  • @user849541 Since I do not know your database engine, there is really no way to use `REGEX` . If your engine supports regular expression, then you could use functions like `regexp_replace` etc – Onyambu Jun 28 '21 at 03:53
  • I am a bit confused by what you mean by this "REGEX"? – user849541 Jun 28 '21 at 08:26
  • @user849541 REGEX simply means regular expressions. It can be used to extract patterns from a string. – Onyambu Jun 28 '21 at 15:57
1

You can use charindex and substr to do what you need. I'll demo with sqldf, which is using SQLite's engine under the hood. (This query is very similar to Onyambu's but solves one issue with text selection.)

dat <- structure(list(Price = c("R 1 750 000", "R 1 250 000", "R 2 520 000"), coordinates = c("-33.93082074573843, 18.857342125467635", "-33.930077157927855, 18.85420954236195", "-33.92954929205658, 18.857504799977896"), floor.size = c("68 m²", "56 m²", "62 m²"), surburb = c("Jhb", "Jhb", "Jhb"), date = c("2021-06-24", "2021-06-17", "2021-06-24")), class = "data.frame", row.names = c(NA, -3L))

out <- sqldf::sqldf(
  "select cast(replace(substr(price,2,99),' ','') as real) as price,
          cast(substr(coordinates,1,charindex(',',coordinates)-1) as real) as lat,
          cast(substr(coordinates,charindex(',',coordinates)+1,99) as real) as long,
          cast(substr([floor.size],1,charindex('m',[floor.size])-1) as real) as [floor.size]
   from dat", method = "raw")

out
#     price       lat     long floor.size
# 1 1750000 -33.93082 18.85734         68
# 2 1250000 -33.93008 18.85421         56
# 3 2520000 -33.92955 18.85750         62

str(out)
# 'data.frame': 3 obs. of  4 variables:
#  $ price     : num  1750000 1250000 2520000
#  $ lat       : num  -33.9 -33.9 -33.9
#  $ long      : num  18.9 18.9 18.9
#  $ floor.size: num  68 56 62

(The number of digits shown in the out output is due to R's "digits" option, those are class numeric as shown in the str output.)

You can shorten that and remove all cast(.. as ..) if you change to sqldf(.., method="numeric").

out <- sqldf::sqldf(
  "select replace(substr(price,2,99),' ','') as price,
          substr(coordinates,1,charindex(',',coordinates)-1) as lat,
          substr(coordinates,charindex(',',coordinates)+1,99) as long,
          substr([floor.size],1,charindex('m',[floor.size])-1) as [floor.size]
   from dat", method = "numeric")
r2evans
  • 141,215
  • 6
  • 77
  • 149