Using PERL regex broke merge of two data.tables

Question

I can't reproduce this with sample data, so please bear with me for a minute.

I'm trying to merge two data.tables, say DT1 and DT2; DT1 is fread in from a .csv file (should be UTF-8, I specified such when I wrote it myself from LibreOffice Calc on Linux), and DT2 was read from an .xls file with readxl::read_excel. Not sure the encoding of the latter but you can get the file here.

I tried matching on a regex'd version of one of the columns. The regex I used to clean the column in both tables was gsub("[[:punct:]]|\$.*\$", "", id). The tables merged fine; however, to improve the matching, I realized I should also exclude leading, trailing, and duplicate whitespace (i.e., to implement this approach to white space cleaning), using the updated regex:

gsub("(?<=[\\s])\\s*|^\\s+|\\s+$|[[:punct:]]|\\(.*\\)", "", name, perl = TRUE)

However, doing so introduces a warning, and the quality of the match deteriorates:

Warning message: In bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, : A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.

It appears that simply activating perl caused something to happen to the encodings and screw with my merge. I poked around in ?regex a bit, and seems PERL treats separate encodings differently, but checking Encoding was no help -- DT1[ , Encoding(id)] and DT2[ , Encoding(id)] are both all) unknown.

Why might changing from POSIX to PERL regex have screwed up my merge attempt?

I think you can just use a TRE regex like this: `gsub("(\\s)\\s*|^\\s+|\\s+$|[[:punct:]]|\$.*\$", "\\1", name)`. However, since `(` and `)` are punctuation, perhaps, you wanted `gsub("(\\s)\\s*|^\\s+|\\s+$|\$[^()]*\$|[[:punct:]]", "\\1", name)`? — Wiktor Stribiżew, Dec 24 '15 at 22:05
@stribizhev that's similar to what I ended up doing. I'm more curious why using perl might have produced mismatched columns. — MichaelChirico, Dec 24 '15 at 22:23

Using PERL regex broke merge of two data.tables

0 Answers0