How to indicate not only delimiter but also it's position ( like you can do in SQL) for separate function in R?

Question

I wanted to know how to split columns indicated delimiter but also a position of it. I need to separate title of the film and the common delimiter is "(", but obviously some movies have brackets in their title as well, soI wanted to indicate that the bracket should be followed by a number, but the number itself shouldn't be used as separator.

Here is the code:

imdb_ratings <-  imdb_ratings %>% separate(col = title, into = c("title", "year"), 
                                           sep = "\\(*[:digit:]")

It obviously throws an error, that all the values in a year column is NA. I already know, that my code tries to use the bracket and a number as a separators ( I guess you can have only one character), but I don't know, how to indicate where the bracket should be. I tried to use smth like this "\\(?=[:digit:]", but it also doesn't work.

[UPDATE]

Here is my code now:

imdb_ratings <- imdb_ratings %>%   filter(Animation == 1 & !str_detect(title, "\\$")) %>% 
                                   separate(col = title, into = c("title", "year"),  
                                            sep = "\\((?=\\d)")

I wanted to filter out the rows that end with backslash, because I know that they don't have a year, that's why I used the code !str_detect(title, "\\$"), but it doesn't work, because after I filtered it, the results come with the same rows that have the backslash at the end: [![enter image description here][1]][1]

[UPDATE2] How to use separate function in order to get the year of the movie in the second column in cases where after a bracket there is not a year but some string character. On the screenshot you can see an example "Aladdin (Video game 1993)" What to do in order to separate the Aladdin in first column and 1993 in the second year column? Maybe option would be to get the Video game within brackets in the first column as well.

[![enter image description here][2]][2]

[UPDATE] The regex string was working all the time, but now suddenly R gives error over it.

The code was not changed:

imdb <- imdb %>%  extract(title, c("title", "year"), 
                          "^(.*?)(?:\s*\([^()]*?(\d{4})[^()]*\))?$") 

the error: Error in drop && length(x) == 1L : invalid 'x' type in 'x && y'

BTW, there is a way to split the columns using nth matches, or even some position in the string, too, but to suggest a solution like that, more details are necessary. — Wiktor Stribiżew, Apr 19 '20 at 18:01

Wiktor Stribiżew · Accepted Answer · 2020-04-19T18:24:14.397

1

If you plan to split a string at a ( char that is followed with a digit, you may use

\((?=\d)

See the regex demo. It matches a ( with \( and the positive lookahead (?=\d) requires the presence of a digit immediately to the right of the current location.

To check if the last char of a string is a backslash, you may use "\\\\$", \\$, pattern. See the regex demo.

In your case, you may use it as

imdb_ratings <- imdb_ratings 
       %>% filter(Animation == 1 & !str_detect(title, "\\\\$")) 
       %>% separate(col = title, into = c("title", "year"), sep = "\\((?=\\d)")

edited Apr 19 '20 at 18:24

answered Apr 19 '20 at 17:59

Wiktor Stribiżew

607,720
39
448
563

Now, I updated my post and attached a screenshot of the table. You can see, that I was unsuccessful at filtering out the rows that end with the backslash. Also, there are rows that don't have the year indicated just after the brackets but after the type of movie it is ( for example video game). So maybe instead of the right after symbol there is a symbol that says that delimiter should be any symbol that comes right before a digit? ( as long as there is a bracket before the character string). – use1883 Apr 19 '20 at 18:18
@use1883 You need `!str_detect(title, "\\\\$")` to filter out the entries with ``\`` at the end – Wiktor Stribiżew Apr 19 '20 at 18:20
@ WiktorStribiżew And how about the second part of the question? – use1883 Apr 19 '20 at 18:41
@use1883 Not sure, what do you mean? BTW, what is the expected output? Please provide sample output for a couple of lines. Do you mean you need `sep = "\\s*(?=\$\\d)"`? Or `extract(title, c("title", "year"), "^(.*?)(?:\\s*\\(([^()]*)\$)?$")`? Or `extract(title, c("title", "year"), "^(.*?)(?:\\s*\$[^()]*?(\\d{4})[^()]*\$)?$")`? – Wiktor Stribiżew Apr 19 '20 at 18:43
@ WiktorStribiżew When I use sep = "\\s*(?=\\(\\d)" then the years disappear, what can I do, so that the \\d itself is not like a delimiter ( but perhaps the symbol just right before the number is the delimiter )? – use1883 Apr 19 '20 at 19:16
@use1883 I am afraid I can't help any more, your question has become too unclear. Provide example strings (dataframe?) and exact expected output. – Wiktor Stribiżew Apr 19 '20 at 19:25
@ WiktorStribiżew I updated my post and attached another screenshot where it's demonstrated that for some rows the title is as follows: "\string ( \string \year )" and I want to get the \year in the second column. – use1883 Apr 19 '20 at 19:56
@use1883 Then use `extract(title, c("title", "year"), "^(.*?)(?:\\s*\$[^()]*?(\\d{4})[^()]*\$)?$")` – Wiktor Stribiżew Apr 19 '20 at 20:05
@ WiktorStribiżew It was helpful, but I couldn't begin to understand the string of the symbols you wrote. :) – use1883 Apr 20 '20 at 10:32
@use1883 If you mean `^(.*?)(?:\\s*\$[^()]*?(\\d{4})[^()]*\$)?$`, you may see [it in action here with explanation on the right](https://regex101.com/r/DUHNNO/1). – Wiktor Stribiżew Apr 20 '20 at 10:34
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/212076/discussion-between-use1883-and-wiktor-stribizew). – use1883 Apr 20 '20 at 11:05
@ WiktorStribiżew Sorry to disturb you, but the line again doesn't run properly and error is given: the error: Error in drop && length(x) == 1L : invalid 'x' type in 'x && y'. Maybe you know what it means? – use1883 Apr 21 '20 at 13:20
@use1883 Use `tidyr::extract` instead of simply `extract` ([see this](https://stackoverflow.com/q/28963899/3832970)) – Wiktor Stribiżew Apr 21 '20 at 13:40

akrun · Answer 2 · 2020-04-19T18:15:54.423

0

We can use a regex lookaround here

library(dplyr)
library(tidyr)
imdb_ratings %>%
     separate(col = title, into = c("title", "year"), 
                                       sep = "\\(?=[[:digit:]])")

If we need to filter out the rows that ends with \, then do a filter

imdb_ratings  %>%
        filter(substring(title, nchar(title)) != '"')

edited Apr 19 '20 at 18:15

answered Apr 19 '20 at 17:52

akrun

874,273
37
540
662

How to indicate not only delimiter but also it's position ( like you can do in SQL) for separate function in R?

2 Answers2