1

I'm given some data in an .rData file. The format is an xts object in character mode. (I realise this is an unusual format, but I have no control over it)

> head(trades)
                    SYMBOL EX  PRICE    SIZE    COND BID    BIDSIZ   OFR   
2012-05-04 09:30:00 "BAC"  "T" "7.89"   "38538" "F"  "7.89" "523"    "7.9" 
2012-05-04 09:30:01 "BAC"  "Z" "7.885"  "288"   "@"  "7.88" "61033"  "7.9" 
2012-05-04 09:30:03 "BAC"  "X" "7.89"   "1000"  "@"  "7.88" "1974"   "7.89"
2012-05-04 09:30:07 "BAC"  "T" "7.89"   "19052" "F"  "7.88" "1058"   "7.89"
2012-05-04 09:30:08 "BAC"  "Y" "7.89"   "85053" "F"  "7.88" "108101" "7.9" 
2012-05-04 09:30:09 "BAC"  "D" "7.8901" "10219" "@"  "7.89" "268"    "7.9" 

> mode(trades)
'character'

I would like to process this data by converting into a saner format, namely a tibble, so that I can store the columns as datetimes, doubles and integers.

I've managed to achieve this with the following code:

> trades_ =  bind_cols(data_frame(DATE=index(trades)), as_data_frame(coredata(trades))) %>%
    mutate_at(as.numeric, .cols=vars(PRICE, BID, OFR)) %>%
    mutate_at(as.integer, .cols=vars(SIZE, BIDSIZ, OFRSIZ))

> head(trades_)
# A tibble: 6 × 10
                 DATE SYMBOL    EX  PRICE  SIZE  COND   BID BIDSIZ   OFR
               <dttm>  <chr> <chr>  <dbl> <int> <chr> <dbl>  <int> <dbl>
1 2012-05-04 09:30:00    BAC     T 7.8900 38538     F  7.89    523  7.90   
2 2012-05-04 09:30:01    BAC     Z 7.8850   288     @  7.88  61033  7.90
3 2012-05-04 09:30:03    BAC     X 7.8900  1000     @  7.88   1974  7.89   
4 2012-05-04 09:30:07    BAC     T 7.8900 19052     F  7.88   1058  7.89   
5 2012-05-04 09:30:08    BAC     Y 7.8900 85053     F  7.88 108101  7.90
6 2012-05-04 09:30:09    BAC     D 7.8901 10219     @  7.89    268  7.90

I'm wondering if there's already a built-in function for this. Something that looks at each column of the trades matrix and figures out whether it's a column of integers, doubles, etc and converts it to the appropriate type.

This is the sort of thing that a csv parser would do.

user357269
  • 1,835
  • 14
  • 40

2 Answers2

1

This is far from an authoritative answer, but I ended up doing this:

smarter_type_convert = function (vector) {
    converted_vector = type.convert(vector)
    if (is.numeric(converted_vector)) {
        int_vector = as.integer(converted_vector)
        if (isTRUE(all.equal(int_vector, converted_vector, check.attributes=FALSE))) {
            int_vector
        } else {
            converted_vector
        }
    } else {
        converted_vector
    }
}

trades %>% coredata %>% as_data_frame %>% mutate_all(smarter_type_convert)
user357269
  • 1,835
  • 14
  • 40
0

You're right that a data frame is the right approach since you are working with multiple classes in the same frame. xts doesn't allow multiple classes so coercion rules force you to work with characters instead of numbers.

Here's a solution so you don't have to specifically call out each column. I use the tidyquant package which is made for working with quantitative data within the "tidyverse" (i.e. using "tidy" data frames). It also has some nice functions to convert to and from xts, matrix and other time series classes that contain row names.

First, I recreate the data.

> trades_xts
                    SYMBOL EX  PRICE    SIZE    COND BID    BIDSIZ   OFR    
2012-05-04 09:30:00 "BAC"  "T" "7.8900" "38538" "F"  "7.89" "523"    "7.90" 
2012-05-04 09:30:01 "BAC"  "Z" "7.8850" "288"   "@"  "7.88" "61033"  "7.90" 
2012-05-04 09:30:03 "BAC"  "X" "7.8900" "1000"  "@"  "7.88" "1974"   "7.89" 
2012-05-04 09:30:07 "BAC"  "T" "7.8900" "19052" "F"  "7.88" "1058"   "7.89"
2012-05-04 09:30:08 "BAC"  "Y" "7.8900" "85053" "F"  "7.88" "108101" "7.90" 
2012-05-04 09:30:09 "BAC"  "D" "7.8901" "10219" "@"  "7.89" "268"    "7.90" 

Next, I use tidy functions to clean the data. It's a little longer than your script, but you don't need to worry about which columns have which data type (with the exception of the xts index). Note that I'm using the tidyquant::as_tibble() functions to convert the xts row names to a column. I use mutate_each to apply the type.convert function to each column. Unfortunately base R loves the factor class so I add an additional step to convert to character. The last two steps just clean up the date-time column using dplyr::rename and lubridate::as_datetime, which tidyquant loads for you.

> library(tidyquant)
> trades_xts %>%
    as_tibble(preserve_row_names = TRUE) %>%
    mutate_each(funs(type.convert)) %>%
    mutate_if(is.factor, as.character) %>%
    rename(DATE = row.names) %>%
    mutate(DATE = as_datetime(DATE, tz = Sys.timezone())) 
    # A tibble: 6 × 9
                 DATE SYMBOL    EX  PRICE  SIZE  COND   BID BIDSIZ   OFR
               <dttm>  <chr> <chr>  <dbl> <int> <chr> <dbl>  <int> <dbl>
1 2012-05-04 09:30:00    BAC     T 7.8900 38538     F  7.89    523  7.90
2 2012-05-04 09:30:01    BAC     Z 7.8850   288     @  7.88  61033  7.90
3 2012-05-04 09:30:03    BAC     X 7.8900  1000     @  7.88   1974  7.89
4 2012-05-04 09:30:07    BAC     T 7.8900 19052     F  7.88   1058  7.89
5 2012-05-04 09:30:08    BAC     Y 7.8900 85053     F  7.88 108101  7.90
6 2012-05-04 09:30:09    BAC     D 7.8901 10219     @  7.89    268  7.90
Matt Dancho
  • 6,840
  • 3
  • 35
  • 26
  • `type.convert` has `as.is` argument. Setting it to `TRUE` prevent conversion to factor. – Marek Sep 15 '20 at 14:49