2

I have a list object like

> bk$x[[1]]
[1] "('bk0000003', 'spbk0002530', 'Certain', 'French editions', 'Abrégé de l''Histoire générale des voyages, contenant ce qu''il y a de plus remarquable, de plus utile et de mieux avéré dans les pays où les voyageurs ont pénétré; les moeurs des habitans, la religion, les usages, arts et sciences, commerce, manufactures... Par M. de La Harpe', 'Abrégé de l''histoire des voyages; abregé de l''histoire generale des Voyages; Abrégé des voyages', NULL, NULL, 'French', 'Hôtel de Thou', NULL, 'Paris', 'Paris', '1780-1786', NULL, NULL, NULL, 23, NULL, '8', '2220', NULL, 'Attribution - only located extant edition at the time it appeared in STN accounts.'),"

I want to extract anything within single quotes separated by the comma, but my strings include commas (and quotation marks) within single quotes. I’m pretty new to the regex syntax in R and my best effort has been some variant of strsplit(bk$x[[1]], ","), which obviously uses commas within single quotes.

I have found similar posts (see, e.g., here, here, and here) on stackoverflow but these do not quite get what I want.

My object (bk) contains >4,300 lists, so I would love to automate the process. I’d appreciate any suggestion you may have.

Emma
  • 27,428
  • 11
  • 44
  • 69
libretone
  • 23
  • 3

2 Answers2

1

An option is strsplit from base R

gsub("^[^']*'|'\\),?$", "", strsplit(str1,  "'(?=,)", perl = TRUE)[[1]])

data

str1 <- "('bk0000003', 'spbk0002530', 'Certain', 'French editions', 'Abrégé de l''Histoire générale des voyages, contenant ce qu''il y a de plus remarquable, de plus utile et de mieux avéré dans les pays où les voyageurs ont pénétré; les moeurs des habitans, la religion, les usages, arts et sciences, commerce, manufactures... Par M. de La Harpe', 'Abrégé de l''histoire des voyages; abregé de l''histoire generale des Voyages; Abrégé des voyages', NULL, NULL, 'French', 'Hôtel de Thou', NULL, 'Paris', 'Paris', '1780-1786', NULL, NULL, NULL, 23, NULL, '8', '2220', NULL, 'Attribution - only located extant edition at the time it appeared in STN accounts.'),"
akrun
  • 874,273
  • 37
  • 540
  • 662
1

Here is a base R option using the following regex pattern:

'.*?'(?:,|$)

This will match all single quoted content, with the end of each entry being marked by either a closing single quote immediately followed by a comma, or by a single quote followed by the end of the input. This logic should get around the issue of both single quotes and commas being allowed inside each entry.

input <- "('bk0000003', 'spbk0002530', 'Certain', 'French editions', 'Abrégé de l''Histoire générale des voyages, contenant ce qu''il y a de plus remarquable, de plus utile et de mieux avéré dans les pays où les voyageurs ont pénétré; les moeurs des habitans, la religion, les usages, arts et sciences, commerce, manufactures... Par M. de La Harpe', 'Abrégé de l''histoire des voyages; abregé de l''histoire generale des Voyages; Abrégé des voyages', NULL, NULL, 'French', 'Hôtel de Thou', NULL, 'Paris', 'Paris', '1780-1786', NULL, NULL, NULL, 23, NULL, '8', '2220', NULL, 'Attribution - only located extant edition at the time it appeared in STN accounts.'"
output <- regmatches(input, gregexpr("'.*?'(?:,|$)", input, perl = TRUE))[[1]]
output <- sub("'(.*)',?$", "\\1", output)
output

 [1] "bk0000003"                                                                                                                                                                                                                                                                            
 [2] "spbk0002530"                                                                                                                                                                                                                                                                          
 [3] "Certain"                                                                                                                                                                                                                                                                              
 [4] "French editions"                                                                                                                                                                                                                                                                      
 [5] "Abrégé de l''Histoire générale des voyages, contenant ce qu''il y a de plus remarquable, de plus utile et de mieux avéré dans les pays où les voyageurs ont pénétré; les moeurs des habitans, la religion, les usages, arts et sciences, commerce, manufactures... Par M. de La Harpe"
 [6] "Abrégé de l''histoire des voyages; abregé de l''histoire generale des Voyages; Abrégé des voyages"                                                                                                                                                                                    
 [7] "French"                                                                                                                                                                                                                                                                               
 [8] "Hôtel de Thou"                                                                                                                                                                                                                                                                        
 [9] "Paris"                                                                                                                                                                                                                                                                                
[10] "Paris"                                                                                                                                                                                                                                                                                
[11] "1780-1786"                                                                                                                                                                                                                                                                            
[12] "8"                                                                                                                                                                                                                                                                                    
[13] "2220"                                                                                                                                                                                                                                                                                 
[14] "Attribution - only located extant edition at the time it appeared in STN accounts."
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Thanks a lot, it worked! This is beyond my initial question, but do you know if there is a way to make NULL NA (or leave NULL as is)? The original dataset is a SQL file, and after turning it into a CSV, my attempt to coerce NULL into NA by read.csv ("....csv", na.strings=c("NULL","NA")) did not work. – libretone Jul 29 '19 at 07:11
  • Well you could do `gsub("NULL,", "'NULL',", input)` to get a `NULL` string. Then use `ifelse(output=="NULL", NA, output)` on the final output. – Tim Biegeleisen Jul 29 '19 at 07:17
  • Thanks so much, Tim! I knew of gsub() but it didn't come to mind. You made my day! – libretone Jul 29 '19 at 07:53