I am trying to extract numeric string from text using python - example : "大田区大森北3−24−27ルミエールN103 " I only want '3-24-27' from a column in df. I tried this but the error says invalid syntax. I am now working with Japanese script but I need this for other languages as well. I am new to python and request some help - Thanks.
Asked
Active
Viewed 99 times
0
-
2Please show what code you used. – Wiktor Stribiżew Jul 11 '18 at 13:34
-
is the numeric string always in the format #-##-##? For instance, you also have 103 at the end of the string, but don't seem to want that. So how do you decide which numeric characters are the ones you really want? – ALollz Jul 11 '18 at 13:35
-
Try adding `# -*- coding: utf-8 -*-` to the top of your file before imports. Try putting your full code so we could figure out the answer – Mohammed Abuiriban Jul 11 '18 at 13:38
-
1Welcome to Stack Overflow! Please review the [guide to asking good questions](https://stackoverflow.com/help/how-to-ask). You should post an example of your code and errors. Since this is a syntax error, seeing your actual code would help check for typos or other issues. – Logan Bertram Jul 11 '18 at 13:46
2 Answers
2
Using str.extract
Ex:
import pandas as pd
df = pd.DataFrame({"a": [ "大田区大森北3−24−27ルミエールN103"]})
print( df["a"].str.extract(r"(\d+−\d+−\d+)") )
Output:
0 3−24−27
Name: a, dtype: object
- Note: I have used
−
not the minus symbol in keyboard(-
)

Rakesh
- 81,458
- 17
- 76
- 113
-
Please accept ans if it solved your problem(tick symbol near the ans) Thanks – Rakesh Jul 11 '18 at 15:09
1
You can do that using only a regex standard library:
import re
pattern = '(\d+−\d+−\d+)'
text = '大田区大森北3−24−27ルミエールN103'
result = re.search(pattern, text)
print(result.group(0))
The pattern is using '\d+' to get only digits and '-', used as the separator in your example.

pafreire
- 146
- 3