I have a regex related question. If I have a variable name nodeName That I am reading from a .csv file and that can look like any of the following: E1_40873886, E2_40873886, 40873886, 40873886-A, 40873886-B. I can write a long piece of code with if elif ... else
but I am wondering if python's regex has a smarter way to do it. Also, I cant hardcode 40873886 like if '40873886' in {entry}:
because the .csv file has a million entires and with varying numberIDs.
Asked
Active
Viewed 132 times
1

Nikhil Gupta
- 551
- 10
- 28
-
What do you actually want to _do_? – TigerhawkT3 Jul 23 '15 at 23:48
-
@TigerhawkT3 just extract "40873886" portion from above strings. Later I wanna append to an array which I am handling. – Nikhil Gupta Jul 23 '15 at 23:49
-
Like `re.search('\d\d+', 'E1_45612786188a').group(0)`? – TigerhawkT3 Jul 23 '15 at 23:54
-
@TigerhawkT3 I can't hardcode because there are like a million entries in the .csv file and i am temporarily saving in a variable "tempNode". so can something like `re.search('\d\d+', tempNode).group(0)` where tempNode has something that looks like one of the following: **E1_40873886, E2_40873886, 40873886, 40873886-A, 40873886-B** work? – Nikhil Gupta Jul 23 '15 at 23:59
-
You'd replace the `'E1_45612786188a'` with a reference to the cell you're interested in. – TigerhawkT3 Jul 24 '15 at 00:00
-
@NikhilGupta, are there always eight digits? Also how many digits can appear after E1 etc... can there be E12345678_...? – Padraic Cunningham Jul 24 '15 at 00:07
-
To extract one ore more digits followed by a word boundary, try [\d+\b](https://regex101.com/r/yH7aM9/1). – Jonny 5 Jul 24 '15 at 04:54
2 Answers
2
Is this what you're looking for? It extracts every digit after the (optional) underscore.
import re
regex = re.compile("(?:[^_]*?_)?(\d*)(?:[^0-9])?")
#SampleNodenames
nodeNames = ["E1_40873886", "E2_40873886", "40873886", "40873886-A", "40873886-B"]
for nodeName in nodeNames:
result = regex.match(nodeName)
print result.group(1)

AccidentallyC
- 92
- 1
- 8
-
This works. I have a followup question: what if **nodeNames** looks like this: `nodeNames=[nodeNames = ["40873886_A", "40873886_E1", "40873886_E2", "40873886_RES_NC-SC", "E1_40873886", "E2_40873886", "ELB: (A): 40873886_E1", "ELB: (A): 40873886_E2", "Node_40873886", "txug: p-38: 40873886", "40873886-A", "4087388-B"]` – Nikhil Gupta Jul 24 '15 at 17:10
-
Slr, is this like, literally how it looks like? Or are you implying that it's an array in an array? – AccidentallyC Jul 25 '15 at 05:33
-
I mean what if **nodeNames** can look like any of the above but want to extract *40873886* portion of it? – Nikhil Gupta Jul 25 '15 at 22:37
-
1Well you could use Padraic's answer above, if u can ensure it always has 8 digits, if you can't assure it, you can modify his regex into "\d{3,}" that is if you can ensure the thing ur extracting has atleast 3 digits. [pastebin link](http://pastebin.com/hskbKcsx) – AccidentallyC Jul 26 '15 at 01:23
1
You can str.translate to just keep the digits, splittin on _
will either give you a single string or just substring after _
, where you can remove any trailing - or uppercase letters :
from string import ascii_uppercase
nodeName.split("_")[-1].translate(None,ascii_uppercase+"-")
Output:
In [44]: nodeName = "E1_40873886"
In [45]: nodeName.split("_")[-1].translate(None,ascii_uppercase+"-")
Out[45]: '40873886'
In [46]: nodeName = "40873886-B"
In [47]: nodeName.split("_")[-1].translate(None,ascii_uppercase+"-")
Out[47]: '40873886'
In [48]: nodeName = "40873886"
In [49]: nodeName.split("_")[-1].translate(None,ascii_uppercase+"-")
Out[49]: '40873886'
You can also rstrip instead of translate:
nodeName.split("_")[-1].rstrip(ascii_uppercase+"-")
If you always have 8 consecutive digits you could also use a simple regex:
import re
s = "E2_40873886"
print(re.search("\d{8}",nodeName).group())

Padraic Cunningham
- 176,452
- 29
- 245
- 321
-
@MoMo, it covers all the possible formats in the question so not sure what you mean – Padraic Cunningham Jul 24 '15 at 00:02