2

I'm trying to create a year column with the year taken from the title column in my dataframe. This code works, but the column dtype is object. For example, in row 1 the year displays as [2013].

How can i do this, but change the column dtype to a float?

year_list = []

for i in range(title_length):
    year = re.findall('\d{4}', wine['title'][i])
    year_list.append(year)

wine['year'] = year_list

Here is the head of my dataframe:

country   designation     points    province               title             year
Italy     Vulkà Bianco     87        Sicily     Nicosia 2013 Vulkà Bianco   [2013]
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
GuyGuyGuy
  • 75
  • 1
  • 5

2 Answers2

2

re.findall returns a list of results. Use re.search

wine['year'] = [re.search('\d{4}', title)[0] for title in wine['title']]

better yet use pandas extract method.

wine['year'] = wine['title'].str.extract(r'\d{4}')

Definition

Series.str.extract(pat, flags=0, expand=True)

For each subject string in the Series, extract groups from the first match of regular expression pat.

Jab
  • 26,853
  • 21
  • 75
  • 114
2

Instead of re.findall that returns a list of strings, you may use str.extract():

wine['year'] = wine['title'].str.extract(r'\b(\d{4})\b')

Or, in case you want to only match 1900-2000s years:

wine['year'] = wine['title'].str.extract(r'\b((?:19|20)\d{2})\b')

Note that the pattern in str.extract must contain at least 1 capturing group, its value will be used to populate the new column. The first match will only be considered, so you might have to precise the context later if need be.

I suggest using word boundaries \b around the \d{4} pattern to match 4-digit chunks as whole words and avoid partial matches in strings like 1234567890.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • @GuyGuyGuy Just in case you want to only match 1900-2000s years you may use `r'\b((?:19|20)\d{2})\b'`. – Wiktor Stribiżew Mar 01 '19 at 22:57
  • Just one more follow up question. what does the "r" mean inside of extract(r'\b(\d{4})\b') – GuyGuyGuy Mar 01 '19 at 22:58
  • Idk if you had it first or not but i added this and didn't see your answer – Jab Mar 01 '19 at 23:00
  • @GuyGuyGuy This is a prefix that defines [*raw string literals*](https://stackoverflow.com/a/2241618/3832970). In these literals, string escape sequences are not supported, a `r'\n'` string literal stands for a 2 char literal string, ``\`` and `n`. If you do not use `r` with the `"\b((?:19|20)\d{2})\b"`, `\b` will be parsed as a BACKSPACE char, and not a word boundary. You would need to write it as `"\\b((?:19|20)\\d{2})\\b"`. Or `"\\b((?:19|20)\d{2})\\b"` (as `\d` is not a valid string escape sequence it may be escaped with a single backslash). – Wiktor Stribiżew Mar 01 '19 at 23:00