-3

The regex I am using is \d+-\d+, but I'm not quite sure about how to separate the Roman numbers and how to create a new column with them.

I have this dataset:

Date_Title                        Date                       Copies
05-21 I. Don Quixote              1605                       252
21-20 IV. Macbeth                 1629                       987
10-12 ML. To Kill a Mockingbird   1960                       478
12 V. Invisible Man               1897                       136

Basically, I would like to split the "Date Title", so, when I print a row, I would get this:

('05-21 I', 'I', 'Don Quixote', 1605, 252)

Or

('10-12 ML', 'ML', 'To Kill a Mockingbird',1960, 478)

In the first place, the numbers and the roman numeral, in the second; only the Roman numeral, in the third the name, and the fourth and fifth would be the same as the dataset.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Jusep
  • 187
  • 8

4 Answers4

2

You can use

df = pd.DataFrame({'Date_Title':['05-21 I. Don Quixote','21-20 IV. Macbeth','10-12 ML. To Kill a Mockingbird','12 V. Invisible Man'], 'Date':[1605,1629,1960,1897], 'Copies':[252,987,478,136]})
rx = r'^(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})))\.\s*(.*)'
df[['NumRoman','Roman','Name']] = df.pop('Date_Title').str.extract(rx)
df = df[['NumRoman','Roman','Name', 'Date', 'Copies']]
>>> df
   NumRoman Roman                   Name  Date  Copies
0   05-21 I     I            Don Quixote  1605     252
1  21-20 IV    IV                Macbeth  1629     987
2  10-12 ML    ML  To Kill a Mockingbird  1960     478
3      12 V     V          Invisible Man  1897     136

See the regex demo. Details:

  • ^ - start of string
  • (\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3}))) - Group 1 ("NumRoman"):
  • \. - a dot
  • \s* - zero or more whitespaces
  • (.*) - Group 3 ("Name"): any zero or more chars other than line break chars, as many as possible

Note df.pop('Date_Title') removes the Date_Title column and yields it as input for the extract method. df = df[['NumRoman','Roman','Name', 'Date', 'Copies']] is necessary if you need to keep the original column order.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

I am pretty sure there might be a more optimal solution, but this is would be a fast way of solving it:

df['Date_Title'] = df['Date_Title'].apply(lambda x: (x.split()[0],x.split()[1],' '.join(x.split()[2:])

Or:

df['Date_Title'] = (df['Date_Title'].str.split().str[0],
                    df['Date_Title'].str.split().str[1],
                    ' '.join(df['Date_Title'].str.split().str[2:])
Celius Stingher
  • 17,835
  • 6
  • 23
  • 53
1

Focusing on the string split:

string = "21-20 IV. Macbeth"
i = string.index(".")  # Finds the first point
date, roman = string[:i].split() # 21-20, IV
title = string[i+2:]  # Macbeth
frab
  • 1,162
  • 1
  • 4
  • 14
1
df=df.assign(x=df['Date_Title'].str.split('\.').str[0],y=df['Date_Title'].str.extract('(\w+(?=\.))'),z=df['Date_Title'].str.split('\.').str[1:].str.join(','))
wwnde
  • 26,119
  • 6
  • 18
  • 32