How to split strings in R and place strings into specific columns based type?

Question

I have a programming challenge that I can't overcome, please help! I have historical time series rates and dates stored as a string in one cell for each security (10K+ security). How can I split the string and store the information in their respective columns? I usually work in R but happy to try it in Python if easier!

Here is what my data looks like, "Security" and "Series" are my column names:

+----+---+---+---+---+---+---+---+---+---+---+---+---+---+--+---+---+----+
| Security |              Series                                         | 
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+--+---+---+----+
| 567895B | "3/15/2019 2.51 3/17/2019 2.30 4/08/2019 2.41"               |
| 165456C | "1/05/2018 2.45 1/28/2018 2.46"                              |
| 123456A | "1/05/2016 2.45 2/05/2016 2.46 3/05/2016 2.45 5/05/2016 2.47"|
+----+---+---+---+---+---+---+---+---+---+---+---+---+---+--+---+---+----+

Here is what I want it to look like:

+----+-------+-------+---------+------+
| Security   | date          |  rate  | 
+----+-------+-------+---------+------+
|  567895B   |   3/15/2019   |  2.51  |
|  567895B   |   3/17/2019   |  2.30  |
|  567895B   |   4/08/2019   |  2.41  |
|  165456C   |   1/05/2018   |  2.45  |      
|  165456C   |   1/28/2018   |  2.46  |
|  123456A   |   1/05/2016   |  2.45  |
|  123456A   |   2/05/2016   |  2.46  |   
|  123456A   |   3/05/2016   |  2.45  |   
|  123456A   |   5/05/2016   |  2.47  |   
+----+-------+-------+---------+------+

score 2 · Accepted Answer · answered May 24 '19 at 00:13

In R, you can use tidyverse to do this. We first separate every values in Series into separate columns. To do that we calculate number of whitespaces in Series and select the maximum value so we know how many columns will be added. We then use gather to convert it into long form, create a group identifier for Date and rate field and cast it to wide format using spread.

library(tidyverse)
n <- max(str_count(df$Series, "\\s+")) + 1

df %>% 
  separate(Series, into = paste0("col", 1:n), sep = "\\s+", fill = "right") %>%
  gather(key, value, -Security, na.rm = TRUE) %>%
  mutate(key = ceiling(as.integer(sub("col", "", key))/2)) %>%
  group_by(Security, key) %>%
  mutate(row = row_number()) %>%
  spread(row, value) %>%
  ungroup() %>%
  select(-key) %>%
  rename_at(2:3, ~c("Date", "rate"))

# A tibble: 9 x 3
#  Security Date      rate
#  <chr>    <chr>     <chr>
#1 123456A  1/05/2016 2.45 
#2 123456A  2/05/2016 2.46 
#3 123456A  3/05/2016 2.45 
#4 123456A  5/05/2016 2.47 
#5 165456C  1/05/2018 2.45 
#6 165456C  1/28/2018 2.46 
#7 567895B  3/15/2019 2.51 
#8 567895B  3/17/2019 2.30 
#9 567895B  4/08/2019 2.41

data

df <- structure(list(Security = c("567895B", "165456C", "123456A"), 
Series = c("3/15/2019 2.51 3/17/2019 2.30 4/08/2019 2.41", 
"1/05/2018 2.45 1/28/2018 2.46", "1/05/2016 2.45 2/05/2016 2.46 3/05/2016 
2.45 5/05/2016 2.47"
)), row.names = c(NA, -3L), class = "data.frame")

Onyambu · Answer 2 · 2019-05-24T02:40:31.877

 library(tidyverse)
 df%>%
   mutate(Series=strsplit(Series," (?=\\d+/)",perl = T))%>%
   unnest()%>%
   separate(Series,c('Date','Rate'),' ',convert = T)

  Security      Date Rate
1  567895B 3/15/2019 2.51
2  567895B 3/17/2019 2.30
3  567895B 4/08/2019 2.41
4  165456C 1/05/2018 2.45
5  165456C 1/28/2018 2.46
6  123456A 1/05/2016 2.45
7  123456A 2/05/2016 2.46
8  123456A 3/05/2016 2.45
9  123456A 5/05/2016 2.47

To use base R, you could do:

m = gregexpr("(^|\\s)(?=\\d+/)",df$Series,perl = T)
read.table(text=`regmatches<-`(df$Series, m ,val=paste("\n",df$Security,' ')))
       V1        V2   V3
1 567895B 3/15/2019 2.51
2 567895B 3/17/2019 2.30
3 567895B 4/08/2019 2.41
4 165456C 1/05/2018 2.45
5 165456C 1/28/2018 2.46
6 123456A 1/05/2016 2.45
7 123456A 2/05/2016 2.46
8 123456A 3/05/2016 2.45
9 123456A 5/05/2016 2.47

G. Grothendieck · Answer 3 · 2019-05-24T12:18:58.043

Using the data frame secs shown reproducibly in the Note at the end, insert a newline and the Security name before each date in the Series string. The replacement character vector is defined as repl. It is inserted into Series returning a character vector ch of the revised Series. Then read ch in using read.table and finally change the date to Date class. No packages are used.

repl <- sprintf("\n%s \\1", secs$Security)
ch <- mapply(gsub, "(\\d+/\\d+/\\d+)", repl, secs$Series)
DF <- read.table(text = ch, col.names = c("Security", "Date", "Value"), as.is = TRUE)
DF$Date <- as.Date(DF$Date, format = "%m/%d/%Y")

giving:

> DF
  Security       Date Value
1  567895B 2019-03-15  2.51
2  567895B 2019-03-17  2.30
3  567895B 2019-04-08  2.41
4  165456C 2018-01-05  2.45
5  165456C 2018-01-28  2.46
6  123456A 2016-01-05  2.45
7  123456A 2016-02-05  2.46
8  123456A 2016-03-05  2.45
9  123456A 2016-05-05  2.47

The same code could alternately be written as the following pipeline:

library(dplyr)

secs %>%
  rowwise() %>%
  mutate(ch = gsub("(\\d+/\\d+/\\d+)", sprintf("\n%s \\1", Security), Series)) %>%
  ungroup %>%
  { read.table(text = .$ch, col.names = c("Security", "Date", "Value"), as.is = TRUE) } %>%
  mutate(Date = as.Date(Date, format = "%m/%d/%Y"))

Note

Lines <- '
 Security |  Series
 567895B | "3/15/2019 2.51 3/17/2019 2.30 4/08/2019 2.41"               
 165456C | "1/05/2018 2.45 1/28/2018 2.46"                              
 123456A | "1/05/2016 2.45 2/05/2016 2.46 3/05/2016 2.45 5/05/2016 2.47" '
secs <- read.table(text = Lines, header = TRUE, sep = "|", na.strings = "+", 
  as.is = TRUE, strip.white = TRUE)

score 0 · Answer 4 · answered May 24 '19 at 00:50

Let see one way to do with the Python module pandas. The steps:

Convert the column Series to a list.
- Method used: pandas.str.split. (doc)
Define the columns Security as an index of the dataframe (usefull for the indexing in the next step).
- Methods used: set_index (doc)
Convert the list in each rows as new lines. Inspired from this post.
- Methods used:
  - stack: stack elements in the list of the rows
  - reset_index: doc
  - to_frame: doc
Arange the new dataframe

Here the code of the different steps:

# Import module
import pandas as pd

# Define dataframe
df = pd.DataFrame([["567895B" , "3/15/2019 2.51 3/17/2019 2.30 4/08/2019 2.41" ],
                    ["165456C" , "1/05/2018 2.45 1/28/2018 2.46" ],
                    ["123456A" , "1/05/2016 2.45 2/05/2016 2.46 3/05/2016 2.45 5/05/2016 2.47"]], 
                    columns = ["Security", "Series"])
print(df)
#   Security                                             Series
# 0  567895B       3/15/2019 2.51 3/17/2019 2.30 4/08/2019 2.41
# 1  165456C                      1/05/2018 2.45 1/28/2018 2.46
# 2  123456A  1/05/2016 2.45 2/05/2016 2.46 3/05/2016 2.45 5...

# Define indexation
df = df.set_index("Security")
print(df)
#                                                      Series
# Security
# 567895B        3/15/2019 2.51 3/17/2019 2.30 4/08/2019 2.41
# 165456C                       1/05/2018 2.45 1/28/2018 2.46
# 123456A   1/05/2016 2.45 2/05/2016 2.46 3/05/2016 2.45 5...

# Transform long string to list
df["Series"] = df["Series"].str.split(' ', expand=False)
print(df)
#                                                      Series
# Security
# 567895B   [3/15/2019, 2.51, 3/17/2019, 2.30, 4/08/2019, ...
# 165456C                  [1/05/2018, 2.45, 1/28/2018, 2.46]
# 123456A   [1/05/2016, 2.45, 2/05/2016, 2.46, 3/05/2016, ...

# Transform list in new rows
df = df.apply(lambda x: pd.Series(x['Series']),axis=1).stack().reset_index(level=1, drop=True).to_frame()
df.columns = ["Series"]
print(df)
#              Series
# Security
# 567895B   3/15/2019
# 567895B        2.51
# 567895B   3/17/2019
# 567895B        2.30
# 567895B   4/08/2019
# 567895B        2.41
# 165456C   1/05/2018
# 165456C        2.45
# 165456C   1/28/2018
# 165456C        2.46
# 123456A   1/05/2016
# 123456A        2.45
# 123456A   2/05/2016
# 123456A        2.46
# 123456A   3/05/2016
# 123456A        2.45
# 123456A   5/05/2016
# 123456A        2.47

# Rebuild dataset
res = pd.DataFrame({"Date": df["Series"][::2], 
                    "Rate": df["Series"][1::2]})
print(res)
#                Date  Rate
# Security
# 567895B   3/15/2019  2.51
# 567895B   3/17/2019  2.30
# 567895B   4/08/2019  2.41
# 165456C   1/05/2018  2.45
# 165456C   1/28/2018  2.46
# 123456A   1/05/2016  2.45
# 123456A   2/05/2016  2.46
# 123456A   3/05/2016  2.45
# 123456A   5/05/2016  2.47

score 0 · Answer 5 · answered May 24 '19 at 01:53

And for an old-fashioned method in base R (without argument checks):

# Starting with values in `df`
  txt <- strsplit(df$Series, "\\s+")
  Security <- rep(df$Security, lengths(txt)/2) # note 'lengths' with 's'
  txt <- unlist(txt)
  date <- txt[seq(1, length(txt), 2)] # here just 'length'
  rate <- txt[seq(2, length(txt), 2)]
  ans <- data.frame(Security, date, rate)

How to split strings in R and place strings into specific columns based type?

5 Answers5

Note