Using regex to split a column

Question

The regex I am using is \d+-\d+, but I'm not quite sure about how to separate the Roman numbers and how to create a new column with them.

I have this dataset:

Date_Title                        Date                       Copies
05-21 I. Don Quixote              1605                       252
21-20 IV. Macbeth                 1629                       987
10-12 ML. To Kill a Mockingbird   1960                       478
12 V. Invisible Man               1897                       136

Basically, I would like to split the "Date Title", so, when I print a row, I would get this:

('05-21 I', 'I', 'Don Quixote', 1605, 252)

Or

('10-12 ML', 'ML', 'To Kill a Mockingbird',1960, 478)

In the first place, the numbers and the roman numeral, in the second; only the Roman numeral, in the third the name, and the fourth and fifth would be the same as the dataset.

What have you tried so far? Where are you stuck? – Nico Haase Mar 29 '21 at 11:41 — Nico Haase, Mar 29 '21 at 11:41

score 2 · Accepted Answer · answered Jan 19 '21 at 00:36

You can use

df = pd.DataFrame({'Date_Title':['05-21 I. Don Quixote','21-20 IV. Macbeth','10-12 ML. To Kill a Mockingbird','12 V. Invisible Man'], 'Date':[1605,1629,1960,1897], 'Copies':[252,987,478,136]})
rx = r'^(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})))\.\s*(.*)'
df[['NumRoman','Roman','Name']] = df.pop('Date_Title').str.extract(rx)
df = df[['NumRoman','Roman','Name', 'Date', 'Copies']]
>>> df
   NumRoman Roman                   Name  Date  Copies
0   05-21 I     I            Don Quixote  1605     252
1  21-20 IV    IV                Macbeth  1629     987
2  10-12 ML    ML  To Kill a Mockingbird  1960     478
3      12 V     V          Invisible Man  1897     136

See the regex demo. Details:

^ - start of string
(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3}))) - Group 1 ("NumRoman"):
- \d+(?:-\d+)? - one or more digits followed with an optional sequence of a - and one or more digits
- \s* - zero or more whitespaces
- (M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})) - Group 2 ("Roman"): see How do you match only valid roman numerals with a regular expression? for explanation
\. - a dot
\s* - zero or more whitespaces
(.*) - Group 3 ("Name"): any zero or more chars other than line break chars, as many as possible

Note df.pop('Date_Title') removes the Date_Title column and yields it as input for the extract method. df = df[['NumRoman','Roman','Name', 'Date', 'Copies']] is necessary if you need to keep the original column order.

score 1 · Answer 2 · answered Jan 18 '21 at 23:56

I am pretty sure there might be a more optimal solution, but this is would be a fast way of solving it:

df['Date_Title'] = df['Date_Title'].apply(lambda x: (x.split()[0],x.split()[1],' '.join(x.split()[2:])

Or:

df['Date_Title'] = (df['Date_Title'].str.split().str[0],
                    df['Date_Title'].str.split().str[1],
                    ' '.join(df['Date_Title'].str.split().str[2:])

score 1 · Answer 3 · answered Jan 18 '21 at 23:59

1

Focusing on the string split:

string = "21-20 IV. Macbeth"
i = string.index(".")  # Finds the first point
date, roman = string[:i].split() # 21-20, IV
title = string[i+2:]  # Macbeth

answered Jan 18 '21 at 23:59

frab

1,162
1
4
14

score 1 · Answer 4 · answered Jan 19 '21 at 00:11

1

df=df.assign(x=df['Date_Title'].str.split('\.').str[0],y=df['Date_Title'].str.extract('(\w+(?=\.))'),z=df['Date_Title'].str.split('\.').str[1:].str.join(','))

answered Jan 19 '21 at 00:11

wwnde

26,119
6
18
32

Using regex to split a column

4 Answers4