8

The order of my data is important. If I load a CSV into R using read.csv, is the order of the rows in the dataframe guaranteed to match that of the CSV?

How about if I load a bunch of CSVs and rbind them together and then use subset to get at the data I'm interested in?

For example:

1.csv

foo,bar
a,123
a,456
c,789

2.csv

foo,bar
d,987
a,999
b,654
a,321

Will the following:

data1<-read.csv("1.csv", header=T)
data2<-read.csv("2.csv", header=T)
all_data<-rbind(data1, data2)
filtered<-subset(all_data, foo=="a")

...always produce a filtered as:

   foo  bar
1    a  123
2    a  456
3    a  999
4    a  321

...and does this behaviour hold for arbitrary CSV inputs and filters?

Xophmeister
  • 8,884
  • 4
  • 44
  • 87
  • 8
    My simple answer would be yes to all questions. – Paulo MiraMor Jan 18 '17 at 18:56
  • The reason I ask is because I'm going to be doing the equivalent of a SQL join, but there is no key and so I can only rely on ordering (where the other data is in the same relative order as the first) – Xophmeister Jan 18 '17 at 18:58
  • 1
    I understand. R won't change row order when reading or binding. – Paulo MiraMor Jan 18 '17 at 19:00
  • @PauloMiraMor Nor subsetting? – Xophmeister Jan 18 '17 at 19:02
  • 3
    Probably a silly question - I'm sure you have your reasons - but, why can't you add an ID/key column? –  Jan 18 '17 at 19:31
  • @user127649 That's my backup plan, but if R preserves order, then that would be a quicker/easier win – Xophmeister Jan 18 '17 at 19:35
  • Fair enough, I was just curious. Though I would say adding a key would be safer because you could always regain the correct order no matter what manipulations/accidents occurred in the future –  Jan 18 '17 at 19:41
  • @user127649 You're preaching to the choir...but this particular choir is very lazy ;) – Xophmeister Jan 18 '17 at 20:21
  • 4
    @user1494080 I'm confused. Do you want a contributor to R's source code to reiterate what Paulo said in the first comment ? If you need more convincing you could look at the source code, `read.csv` calls `read.table`, `read.table` calls `scan`, scan reads a file or a connection and parses it. There's no row shuffling happening anywhere, it would be a huge issue if it did. However you could have empty lines in your CSV and those will be skipped if you keep `blank.lines.skip` set to TRUE, in this case your `data.frame` might be `shorter` than your csv. – moodymudskipper Feb 20 '18 at 13:24
  • 1
    @Moody_Mudskipper Thanks. The question is whether it is _guaranteed_ that the order is preserved. Implementations can change. The hope was that there is a note somewhere in the specification/documentation. – user1494080 Feb 20 '18 at 14:20
  • 6
    The doc says `Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.`. I don't think the `R` doc guarantees anything anywhere, and nobody can be held legally responsible if something goes wrong (I think it's the favorite argument from proprietary software proponents), but your question is a bit like asking if `max` will always return the maximum, if it stops doing so pretty much all the R code of the world will break. As long as you stick to base R functions and major releases you're safe beyond a reasonable doubt. – moodymudskipper Feb 20 '18 at 14:36
  • You can add extra safety by calling explicitly `utils::read.csv`, `base::subset` and `base::rbind` so they can't be masked by a treacherous function. And about `subset` you might want to read this: https://stackoverflow.com/questions/9860090/why-is-better-than-subset – moodymudskipper Feb 20 '18 at 14:40

3 Answers3

4

Have a read through the source code for read.table. It uses the scan base function, which itself uses the file and textConnection functions. All of these appear to point toward you being able to read in data sequentially ("line" by "line" based on delimiter) and feeding it in.

function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", 
    numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, 
    col.names, as.is = !stringsAsFactors, na.strings = "NA", 
    colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, 
    fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, 
    comment.char = "#", allowEscapes = FALSE, flush = FALSE, 
    stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", 
    encoding = "unknown", text, skipNul = FALSE) 
{
    if (missing(file) && !missing(text)) {
        file <- textConnection(text, encoding = "UTF-8")
        encoding <- "UTF-8"
        on.exit(close(file))
    }
    if (is.character(file)) {
        file <- if (nzchar(fileEncoding)) 
            file(file, "rt", encoding = fileEncoding)
        else file(file, "rt")
        on.exit(close(file))
    }
    if (!inherits(file, "connection")) 
        stop("'file' must be a character string or connection")
    if (!isOpen(file, "rt")) {
        open(file, "rt")
        on.exit(close(file))
    }
    pbEncoding <- if (encoding %in% c("", "bytes", "UTF-8")) 
        encoding
    else "bytes"
    numerals <- match.arg(numerals)
    if (skip > 0L) 
        readLines(file, skip)
    nlines <- n0lines <- if (nrows < 0L) 
        5
    else min(5L, (header + nrows))
    lines <- .External(C_readtablehead, file, nlines, comment.char, 
        blank.lines.skip, quote, sep, skipNul)
    if (encoding %in% c("UTF-8", "latin1")) 
        Encoding(lines) <- encoding
    nlines <- length(lines)
    if (!nlines) {
        if (missing(col.names)) 
            stop("no lines available in input")
        rlabp <- FALSE
        cols <- length(col.names)
    }
    else {
        if (all(!nzchar(lines))) 
            stop("empty beginning of file")
        if (nlines < n0lines && file == 0L) {
            pushBack(c(lines, lines, ""), file, encoding = pbEncoding)
            on.exit((clearPushBack(stdin())))
        }
        else pushBack(c(lines, lines), file, encoding = pbEncoding)
        first <- scan(file, what = "", sep = sep, quote = quote, 
            nlines = 1, quiet = TRUE, skip = 0, strip.white = TRUE, 
            blank.lines.skip = blank.lines.skip, comment.char = comment.char, 
            allowEscapes = allowEscapes, encoding = encoding, 
            skipNul = skipNul)
        col1 <- if (missing(col.names)) 
            length(first)
        else length(col.names)
        col <- numeric(nlines - 1L)
        if (nlines > 1L) 
            for (i in seq_along(col)) col[i] <- length(scan(file, 
                what = "", sep = sep, quote = quote, nlines = 1, 
                quiet = TRUE, skip = 0, strip.white = strip.white, 
                blank.lines.skip = blank.lines.skip, comment.char = comment.char, 
                allowEscapes = allowEscapes, encoding = encoding, 
                skipNul = skipNul))
        cols <- max(col1, col)
        rlabp <- (cols - col1) == 1L
        if (rlabp && missing(header)) 
            header <- TRUE
        if (!header) 
            rlabp <- FALSE
        if (header) {
            .External(C_readtablehead, file, 1L, comment.char, 
                blank.lines.skip, quote, sep, skipNul)
            if (missing(col.names)) 
                col.names <- first
            else if (length(first) != length(col.names)) 
                warning("header and 'col.names' are of different lengths")
        }
        else if (missing(col.names)) 
            col.names <- paste0("V", 1L:cols)
        if (length(col.names) + rlabp < cols) 
            stop("more columns than column names")
        if (fill && length(col.names) > cols) 
            cols <- length(col.names)
        if (!fill && cols > 0L && length(col.names) > cols) 
            stop("more column names than columns")
        if (cols == 0L) 
            stop("first five rows are empty: giving up")
    }
    if (check.names) 
        col.names <- make.names(col.names, unique = TRUE)
    if (rlabp) 
        col.names <- c("row.names", col.names)
    nmColClasses <- names(colClasses)
    if (is.null(nmColClasses)) {
        if (length(colClasses) < cols) 
            colClasses <- rep_len(colClasses, cols)
    }
    else {
        tmp <- rep_len(NA_character_, cols)
        names(tmp) <- col.names
        i <- match(nmColClasses, col.names, 0L)
        if (any(i <= 0L)) 
            warning("not all columns named in 'colClasses' exist")
        tmp[i[i > 0L]] <- colClasses[i > 0L]
        colClasses <- tmp
    }
    what <- rep.int(list(""), cols)
    names(what) <- col.names
    colClasses[colClasses %in% c("real", "double")] <- "numeric"
    known <- colClasses %in% c("logical", "integer", "numeric", 
        "complex", "character", "raw")
    what[known] <- sapply(colClasses[known], do.call, list(0))
    what[colClasses %in% "NULL"] <- list(NULL)
    keep <- !sapply(what, is.null)
    data <- scan(file = file, what = what, sep = sep, quote = quote, 
        dec = dec, nmax = nrows, skip = 0, na.strings = na.strings, 
        quiet = TRUE, fill = fill, strip.white = strip.white, 
        blank.lines.skip = blank.lines.skip, multi.line = FALSE, 
        comment.char = comment.char, allowEscapes = allowEscapes, 
        flush = flush, encoding = encoding, skipNul = skipNul)
    nlines <- length(data[[which.max(keep)]])
    if (cols != length(data)) {
        warning("cols = ", cols, " != length(data) = ", length(data), 
            domain = NA)
        cols <- length(data)
    }
    if (is.logical(as.is)) {
        as.is <- rep_len(as.is, cols)
    }
    else if (is.numeric(as.is)) {
        if (any(as.is < 1 | as.is > cols)) 
            stop("invalid numeric 'as.is' expression")
        i <- rep.int(FALSE, cols)
        i[as.is] <- TRUE
        as.is <- i
    }
    else if (is.character(as.is)) {
        i <- match(as.is, col.names, 0L)
        if (any(i <= 0L)) 
            warning("not all columns named in 'as.is' exist")
        i <- i[i > 0L]
        as.is <- rep.int(FALSE, cols)
        as.is[i] <- TRUE
    }
    else if (length(as.is) != cols) 
        stop(gettextf("'as.is' has the wrong length %d  != cols = %d", 
            length(as.is), cols), domain = NA)
    do <- keep & !known
    if (rlabp) 
        do[1L] <- FALSE
    for (i in (1L:cols)[do]) {
        data[[i]] <- if (is.na(colClasses[i])) 
            type.convert(data[[i]], as.is = as.is[i], dec = dec, 
                numerals = numerals, na.strings = character(0L))
        else if (colClasses[i] == "factor") 
            as.factor(data[[i]])
        else if (colClasses[i] == "Date") 
            as.Date(data[[i]])
        else if (colClasses[i] == "POSIXct") 
            as.POSIXct(data[[i]])
        else methods::as(data[[i]], colClasses[i])
    }
    compactRN <- TRUE
    if (missing(row.names)) {
        if (rlabp) {
            row.names <- data[[1L]]
            data <- data[-1L]
            keep <- keep[-1L]
            compactRN <- FALSE
        }
        else row.names <- .set_row_names(as.integer(nlines))
    }
    else if (is.null(row.names)) {
        row.names <- .set_row_names(as.integer(nlines))
    }
    else if (is.character(row.names)) {
        compactRN <- FALSE
        if (length(row.names) == 1L) {
            rowvar <- (1L:cols)[match(col.names, row.names, 0L) == 
                1L]
            row.names <- data[[rowvar]]
            data <- data[-rowvar]
            keep <- keep[-rowvar]
        }
    }
    else if (is.numeric(row.names) && length(row.names) == 1L) {
        compactRN <- FALSE
        rlabp <- row.names
        row.names <- data[[rlabp]]
        data <- data[-rlabp]
        keep <- keep[-rlabp]
    }
    else stop("invalid 'row.names' specification")
    data <- data[keep]
    if (is.object(row.names) || !(is.integer(row.names))) 
        row.names <- as.character(row.names)
    if (!compactRN) {
        if (length(row.names) != nlines) 
            stop("invalid 'row.names' length")
        if (anyDuplicated(row.names)) 
            stop("duplicate 'row.names' are not allowed")
        if (anyNA(row.names)) 
            stop("missing values in 'row.names' are not allowed")
    }
    class(data) <- "data.frame"
    attr(data, "row.names") <- row.names
    data
}
Kamil
  • 412
  • 4
  • 11
3

This is a basic code that you can use to double check results coming from read.csv and subset:

Compare read.csv with readLines

Here you have a code that compare the result coming from read.csv with readLines (function reading line by line a file)

  library("readr" )
  library("rlist")
  file1<-file.choose() #Select your csv file1
  file2<-file.choose() #Select your csv file2

  #readLines
  input_list<-strsplit(readLines(file1),",")
  db_readLines<-data.frame(list.rbind(input_list[2:length(input_list)]))
  names(db_readLines)<-input_list[[1]]

  #readd.csv
  db_readcsv<-read.csv(file1,header = T,sep = ",")

  #Comparison
  if ((sum(db_readcsv==db_readLines)/(nrow(db_readcsv)*ncol(db_readcsv)))==1)
  {
    cat("Same data.frame")
  } else
  {
    cat("Data.frames are differents")
  }

You can use it with your csv file to compare results and verify that read.csv preserves lines order as readLines.

Compare subset with rbind + basic filtering

About the second part of the question another easy test:

data1<-read.csv(file1, header=T,sep=",")
  data2<-read.csv(file2, header=T,sep=",")
  all_data<-rbind(data1, data2)
  filtered1<-subset(all_data, foo=="a")

  filtered2<-rbind(data1[data1$foo=="a",],data2[data2$foo=="a",])

  #Comparison
  if ((sum(filtered1==filtered2)/(nrow(filtered2)*ncol(filtered2)))==1)
  {
    cat("Same data.frame")
  } else
  {
    cat("Data.frames are differents")
  }

You can include this kind of tests in your code, but obviously this is inefficient and redundant.

Terru_theTerror
  • 4,918
  • 2
  • 20
  • 39
3

It is safe to assume that all of those functions (read.csv, rbind, and subset) are guaranteed to preserve the order of your data as in the original csv.

Personally, I prefer using dplyr::filter over base::subset. As explained in this answer the two work almost identically. The main difference is that subset comes with a warning in ?subset: "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences." filter is designed to work robustly with the rest of dplyr and the tidyverse, both interactively and programmatically, and has a separate standard evaluation version filter_ for when necessary.. So perhaps filter is a safer bet, especially if you're already using the dplyr framework. The only disadvantage to filter that I've encountered is that it does not keep rownames, while subset does.

Either way, I really don't think you need to worry about rows being reshuffled. In my experience, all of these functions have always produced R objects ordered in terms of the original data. If you want to be ultra-careful, it wouldn't hurt to go with @user127649's suggestion and add a unique ID column as a back up. I'm always in favor of lazier options, but it might be worth peace of mind!

Joy
  • 51
  • 5