0

I am trying to extract numeric string from text using python - example : "大田区大森北3−24−27ルミエールN103 " I only want '3-24-27' from a column in df. I tried this but the error says invalid syntax. I am now working with Japanese script but I need this for other languages as well. I am new to python and request some help - Thanks.

kIRTI
  • 1
  • 1
  • 2
    Please show what code you used. – Wiktor Stribiżew Jul 11 '18 at 13:34
  • is the numeric string always in the format #-##-##? For instance, you also have 103 at the end of the string, but don't seem to want that. So how do you decide which numeric characters are the ones you really want? – ALollz Jul 11 '18 at 13:35
  • Try adding `# -*- coding: utf-8 -*-` to the top of your file before imports. Try putting your full code so we could figure out the answer – Mohammed Abuiriban Jul 11 '18 at 13:38
  • 1
    Welcome to Stack Overflow! Please review the [guide to asking good questions](https://stackoverflow.com/help/how-to-ask). You should post an example of your code and errors. Since this is a syntax error, seeing your actual code would help check for typos or other issues. – Logan Bertram Jul 11 '18 at 13:46

2 Answers2

2

Using str.extract

Ex:

import pandas as pd
df = pd.DataFrame({"a": [ "大田区大森北3−24−27ルミエールN103"]})
print( df["a"].str.extract(r"(\d+−\d+−\d+)") )

Output:

0    3−24−27
Name: a, dtype: object
  • Note: I have used not the minus symbol in keyboard(-)
Rakesh
  • 81,458
  • 17
  • 76
  • 113
1

You can do that using only a regex standard library:

import re

pattern = '(\d+−\d+−\d+)'
text = '大田区大森北3−24−27ルミエールN103'
result = re.search(pattern, text)
print(result.group(0))

The pattern is using '\d+' to get only digits and '-', used as the separator in your example.

pafreire
  • 146
  • 3