1

I want to read a .dbf file in R that contains "deleted" rows. However, I don´t know how to avoid reading all records. I am using the package "foreign" and the function read.dbf.

According to the .dbf file description, each record begins with a 1-byte "deletion" flag. The byte's value is a space (0x20), if the record is active, or an asterisk (0x2A), if the record is deleted.

How can I extract this information using R? E.g. for a small sample of the iris data set saved as a .dbf file:

library(foreign)
dbf.file <- 'iris.dbf'
write.dbf(iris[1:5, ], file=dbf.file)
ana_gg
  • 370
  • 1
  • 9
  • 1
    DBF deleted records start with a one byte character, the asterisk, ASCII 2A (hex) or 42 (decimal). And if the record is not deleted, that byte should be a space, 20 hex or 32 decimal. I am not sure how this can be of help, this century I haven't worked with DBF files yet. – Rui Barradas Aug 25 '23 at 14:36
  • Thanks. I read that information in another question here in stack overflow, but I don´t know how it helps in practice. – ana_gg Aug 25 '23 at 14:54
  • 2
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Aug 25 '23 at 15:03

1 Answers1

1

We can use the readBin() function to read the .dbf file as binary data.

# read binary data
bin <- readBin(dbf.file, what='raw', n=file.size(dbf.file))

Then, based on the .dbf format description, we can read information necessary to navigate ourselves to the first byte of each record. I use a custom function to convert the appropriate bytes from the .dbf header into an unsigned integer.

bin_to_int <- function(bin) {
  if (.Platform$endian == 'big') {
    bin <- rev(bin)
  }
  sum(2^(which(as.logical(rawToBits(bin))) - 1))
}

# number of records
n <- bin_to_int(bin[5:8])
# numer of bytes in the header
header.len <- bin_to_int(bin[9:10])
# numer of bytes in the record
record.len <- bin_to_int(bin[11:12])

With these, it is possible to compute what are the records' first bytes and see if they mark the record as deleted or not.

# first bytes of records
record.starts <- header.len + 1 + (seq_len(n)-1) * record.len
is.deleted <- bin[record.starts] == 0x2A
is.deleted
# [1] FALSE FALSE FALSE FALSE FALSE

Indeed, none of the records were marked as deleted so we can at least check if the bytes hold the expected value of 0x20:

bin[record.starts]
# [1] 20 20 20 20 20

On a side note, from documentation it is not clear how read.dbf() treats the deleted records so chances are it ignores them and you won't have to deal with this issue at all. It would be interesting to know this so please let us know in the comments.

Robert Hacken
  • 3,878
  • 1
  • 13
  • 15
  • 1
    Thank you, Robert,! Indeed, it worked! The final part should be to read the file as final.file=read.dbf(dbf.file), and then final.file = final.file[!is.deleted,]. The read.dbf() does not recognize the deleted, it reads all records and does not have a parameter to avoid reading all. – ana_gg Aug 28 '23 at 14:35