How can I convert a matrix of strings into a tibble?

Question

I'm given some data in an .rData file. The format is an xts object in character mode. (I realise this is an unusual format, but I have no control over it)

> head(trades)
                    SYMBOL EX  PRICE    SIZE    COND BID    BIDSIZ   OFR   
2012-05-04 09:30:00 "BAC"  "T" "7.89"   "38538" "F"  "7.89" "523"    "7.9" 
2012-05-04 09:30:01 "BAC"  "Z" "7.885"  "288"   "@"  "7.88" "61033"  "7.9" 
2012-05-04 09:30:03 "BAC"  "X" "7.89"   "1000"  "@"  "7.88" "1974"   "7.89"
2012-05-04 09:30:07 "BAC"  "T" "7.89"   "19052" "F"  "7.88" "1058"   "7.89"
2012-05-04 09:30:08 "BAC"  "Y" "7.89"   "85053" "F"  "7.88" "108101" "7.9" 
2012-05-04 09:30:09 "BAC"  "D" "7.8901" "10219" "@"  "7.89" "268"    "7.9" 

> mode(trades)
'character'

I would like to process this data by converting into a saner format, namely a tibble, so that I can store the columns as datetimes, doubles and integers.

I've managed to achieve this with the following code:

> trades_ =  bind_cols(data_frame(DATE=index(trades)), as_data_frame(coredata(trades))) %>%
    mutate_at(as.numeric, .cols=vars(PRICE, BID, OFR)) %>%
    mutate_at(as.integer, .cols=vars(SIZE, BIDSIZ, OFRSIZ))

> head(trades_)
# A tibble: 6 × 10
                 DATE SYMBOL    EX  PRICE  SIZE  COND   BID BIDSIZ   OFR
               <dttm>  <chr> <chr>  <dbl> <int> <chr> <dbl>  <int> <dbl>
1 2012-05-04 09:30:00    BAC     T 7.8900 38538     F  7.89    523  7.90   
2 2012-05-04 09:30:01    BAC     Z 7.8850   288     @  7.88  61033  7.90
3 2012-05-04 09:30:03    BAC     X 7.8900  1000     @  7.88   1974  7.89   
4 2012-05-04 09:30:07    BAC     T 7.8900 19052     F  7.88   1058  7.89   
5 2012-05-04 09:30:08    BAC     Y 7.8900 85053     F  7.88 108101  7.90
6 2012-05-04 09:30:09    BAC     D 7.8901 10219     @  7.89    268  7.90

I'm wondering if there's already a built-in function for this. Something that looks at each column of the trades matrix and figures out whether it's a column of integers, doubles, etc and converts it to the appropriate type.

This is the sort of thing that a csv parser would do.

`?type.convert`, which is as you guessed, used as a helper function for `read.csv` `read.table` etc. — thelatemail, Feb 27 '17 at 00:13
You're probably better off explicitly setting the types like this. Tracking down bugs when the computer guesses wrong can be a pain. — Hong Ooi, Feb 27 '17 at 00:19
@HongOoi: I think you might be right, `type.convert` didn't like `"1e+05"` in the `SIZE` column... — user357269, Feb 27 '17 at 00:24
@user357269 - are you sure about that? `type.convert(c("1000","1.25","1e1","1e+05"))` returns a numeric vector for me. — thelatemail, Feb 27 '17 at 00:44
@thelatemail, yeah, sorry if I was unclear, my problem is that `SIZE` ought to be an integer — user357269, Feb 27 '17 at 00:46

score 1 · Accepted Answer · answered Feb 27 '17 at 02:13

This is far from an authoritative answer, but I ended up doing this:

smarter_type_convert = function (vector) {
    converted_vector = type.convert(vector)
    if (is.numeric(converted_vector)) {
        int_vector = as.integer(converted_vector)
        if (isTRUE(all.equal(int_vector, converted_vector, check.attributes=FALSE))) {
            int_vector
        } else {
            converted_vector
        }
    } else {
        converted_vector
    }
}

trades %>% coredata %>% as_data_frame %>% mutate_all(smarter_type_convert)

score 0 · Answer 2 · answered Mar 02 '17 at 18:52

You're right that a data frame is the right approach since you are working with multiple classes in the same frame. xts doesn't allow multiple classes so coercion rules force you to work with characters instead of numbers.

Here's a solution so you don't have to specifically call out each column. I use the tidyquant package which is made for working with quantitative data within the "tidyverse" (i.e. using "tidy" data frames). It also has some nice functions to convert to and from xts, matrix and other time series classes that contain row names.

First, I recreate the data.

> trades_xts
                    SYMBOL EX  PRICE    SIZE    COND BID    BIDSIZ   OFR    
2012-05-04 09:30:00 "BAC"  "T" "7.8900" "38538" "F"  "7.89" "523"    "7.90" 
2012-05-04 09:30:01 "BAC"  "Z" "7.8850" "288"   "@"  "7.88" "61033"  "7.90" 
2012-05-04 09:30:03 "BAC"  "X" "7.8900" "1000"  "@"  "7.88" "1974"   "7.89" 
2012-05-04 09:30:07 "BAC"  "T" "7.8900" "19052" "F"  "7.88" "1058"   "7.89"
2012-05-04 09:30:08 "BAC"  "Y" "7.8900" "85053" "F"  "7.88" "108101" "7.90" 
2012-05-04 09:30:09 "BAC"  "D" "7.8901" "10219" "@"  "7.89" "268"    "7.90"

Next, I use tidy functions to clean the data. It's a little longer than your script, but you don't need to worry about which columns have which data type (with the exception of the xts index). Note that I'm using the tidyquant::as_tibble() functions to convert the xts row names to a column. I use mutate_each to apply the type.convert function to each column. Unfortunately base R loves the factor class so I add an additional step to convert to character. The last two steps just clean up the date-time column using dplyr::rename and lubridate::as_datetime, which tidyquant loads for you.

> library(tidyquant)
> trades_xts %>%
    as_tibble(preserve_row_names = TRUE) %>%
    mutate_each(funs(type.convert)) %>%
    mutate_if(is.factor, as.character) %>%
    rename(DATE = row.names) %>%
    mutate(DATE = as_datetime(DATE, tz = Sys.timezone())) 
    # A tibble: 6 × 9
                 DATE SYMBOL    EX  PRICE  SIZE  COND   BID BIDSIZ   OFR
               <dttm>  <chr> <chr>  <dbl> <int> <chr> <dbl>  <int> <dbl>
1 2012-05-04 09:30:00    BAC     T 7.8900 38538     F  7.89    523  7.90
2 2012-05-04 09:30:01    BAC     Z 7.8850   288     @  7.88  61033  7.90
3 2012-05-04 09:30:03    BAC     X 7.8900  1000     @  7.88   1974  7.89
4 2012-05-04 09:30:07    BAC     T 7.8900 19052     F  7.88   1058  7.89
5 2012-05-04 09:30:08    BAC     Y 7.8900 85053     F  7.88 108101  7.90
6 2012-05-04 09:30:09    BAC     D 7.8901 10219     @  7.89    268  7.90

`type.convert` has `as.is` argument. Setting it to `TRUE` prevent conversion to factor. — Marek, Sep 15 '20 at 14:49

How can I convert a matrix of strings into a tibble?

2 Answers2

Linked