Subsetting from serialized string in R

Question

I get a field from a SQL database in which various information is stored as a serialized object. I would like to extract the values from this string in R and save them as separate columns.

Here is a simplified example:

df <- data.frame(ID = c(1,2,3),
                 information = c("{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";}",
                                 "{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";}",
                                 "{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}") 
                 )

df <- df %>% 
  mutate(
    name = substring(information, gregexpr('"', information)[[1]][3] +1, gregexpr('"', information)[[1]][4]-1)
  )

df

I tried to select the value between the third and fourth quotes. Unfortunately, this selection only works for the first value. For the other selections, the positions are obviously taken from the first value.

[1] "max" "pet" "sus"

Thank you for your ideas

I'd generally recommend using a proper parser rather than an ad-hoc and error-prone regular expression (for instance, your current approach is sensitive to the ordering of the keys inside the serialised string). That said, the issue is caused by the fact that you are always selecting the first hit returned by `gregexpr`. To fix the issue, you need to extract the appropriate value. This can be done using e.g. `vapply`: ``vapply(grepexpr('"', b), `[`, integer(1L), 3L) + 1L``, etc. — Konrad Rudolph, May 15 '23 at 07:12

Chris Ruehlemann · Answer 1 · 2023-05-15T07:33:45.800

1

You can use extract if you want to extract all the 'serialized' variables:

library(tidyr)
df %>% 
  extract(information,
          into = str_c("var", 1:4),
          regex = '.*"(\\w+).*"(\\w+).*"(\\w+).*"(\\w+).*"')
  ID var1  var2 var3 var4
1  1 name   max  age    8
2  2 name peter  age   10
3  3 name  susy  age  100

Alternatively, if you just want the nameand agecolumns, you can use str_extract and positive look-behind:

library(stringr)
df %>%
  mutate(name = str_extract(information, '(?<=name.{1,10}")\\w+'),
         age = str_extract(information, '(?<=age.{1,10}")\\d+'))
  ID                                  information  name age
1  1    {s:4:"name";s:3:"max";s:3:"age";s:1:"8";}   max   8
2  2 {s:4:"name";s:5:"peter";s:3:"age";s:2:"10";} peter  10
3  3 {s:4:"name";s:4:"susy";s:3:"age";s:3:"100";}  susy 100

Here the regex pattern for name specifies the following instructions:

(?<=name.{1,10}"): if you see the string namefollowed by at least 1 and at most 10 more characters of any kind and followed by " then ...
\\w+: ... extract the alphanumeric string thereafter

edited May 15 '23 at 07:33

answered May 15 '23 at 07:13

Chris Ruehlemann

20,321
4
12
34

thanks !!! I'm excited. I think I have to deal more intensively with regular expression. – saibot May 15 '23 at 07:24
Absolutely! Regex is exciting. Please consider upvoting and/or accepting the answer if it helps you solve your problem – Chris Ruehlemann May 15 '23 at 07:33
How do I get decimal numbers, e.g. if the age is 3.10 with (?<=age.{1,10}")\\d+') ? – saibot May 18 '23 at 21:26
Use this pattern: `'(?<=age.{1,10}")\\d+(\\.\\d+)?'`. And when that helps please consider upvoting/accepting answers: 4 ppl have invested time and expertise into helping you solve your problem – Chris Ruehlemann May 19 '23 at 06:23
many thanks for the help. Of course, I had already tried to vote for the post, but unfortunately it's not possible yet. You need at least 15 reputation to cast a vote. – saibot May 22 '23 at 18:45

margusl · Answer 2 · 2023-05-15T07:59:44.000

Looks like a serialized PHP object. Here's a slightly different regex approach, it still assumes consistent structure but allows us to verify consistency in key columns and does handle punctuation in names and ages (added offending record to input dataset):

library(dplyr)
library(tidyr)
library(stringr)

df <- data.frame(ID = c(1,2,3,4),
                 information = c("{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";}",
                                 "{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";}",
                                 "{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}",
                                 "{s:5:\"valid\";s:2:\"no\";s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}") 
)

df %>% 
  # match anything but double-quotes that are between ':"' and '";', 
  mutate(extr = str_extract_all(information, '(?<=:")[^"]+(?=";)')) %>% 
  unnest_wider(extr, names_sep = "_")
#> # A tibble: 4 × 8
#>      ID information                    extr_1 extr_2 extr_3 extr_4 extr_5 extr_6
#>   <dbl> <chr>                          <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
#> 1     1 "{s:4:\"name\";s:3:\"max\";s:… name   max    age    8      <NA>   <NA>  
#> 2     2 "{s:4:\"name\";s:5:\"peter\";… name   peter  age    10     <NA>   <NA>  
#> 3     3 "{s:4:\"name\";s:4:\"susy\";s… name   susy   age    100    <NA>   <NA>  
#> 4     4 "{s:5:\"valid\";s:2:\"no\";s:… valid  no     name   susy   age    100

^{Created on 2023-05-15 with reprex v2.0.2}

thanks a lot. exactly, its a serialized PHP Object in one datafield with about 30 information. — saibot, May 15 '23 at 08:34

score 1 · Accepted Answer · answered May 15 '23 at 07:48

Split on delimiter and read as delimited text, keep columns we need:

read.delim(text = gsub(";", ":", df$information, fixed = TRUE),
           sep = ":", header = FALSE)[, c(3, 6, 9, 12)]
#     V3    V6  V9 V12
# 1 name   max age   8
# 2 name peter age  10
# 3 name  susy age 100

Note: there is a pattern seq(3, 12, 3) to automate the column selection.

mmrabe · Answer 4 · 2023-05-15T13:10:29.503

If you still have access to the complete serialized strings, you could also use PHP to transform to a portable intermediate format like JSON, which R can parse. Then you won't have to take care of all the complex escaping rules in those strings. For example:



library(tidyverse)

unserialize_php <- function(information) {
  system2("php", c("-r", shQuote("$output = []; while(($line=fgets(STDIN))!==false) {$new_output = unserialize($line); $output[] = $new_output ? $new_output : (object)[];} echo(json_encode($output));")), stdout = TRUE, stderr = "", input = information) %>% fromJSON(simplifyDataFrame = TRUE)
}

df <- tibble(
  ID = 1:3,
  information = unserialize_php(
    c("a:2:{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";};",
      "a:2:{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";};",
      "a:2:{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";};")
  )
)

df

# # A tibble: 3 × 2
#      ID information$name $age 
#   <int> <chr>            <chr>
# 1     1 max              8    
# 2     2 peter            10   
# 3     3 susy             100

The critical piece here is the unserialize_php function. It forwards all values of the argument information (or, the information column of the data frame df, respectively) as separate lines to a temporary PHP instance. The script to be run in PHP is supplied as a command-line argument (-r). The script then takes the input line by line, unserializes each line, adds them to a temporary array($output) and finally outputs the array as JSON. PHP writes to stdout and R picks it up and returns it as the return value of system2 (because stdin = TRUE). fromJSON from the jsonlite package parses the JSON data and transforms it to a data frame. If the string representations passed to the function are malformed, you will see PHP errors in the R console and the respective lines will be empty / NAs.

Note: You need the complete serialized strings, as originally produced by PHP. I edited your original values to what those should look like. Also, this approach requires that R can call PHP via command line on your machine.

Subsetting from serialized string in R

4 Answers4