Ignore all characters and just read numbers portion from the string

Question

I have a regex related question. If I have a variable name nodeName That I am reading from a .csv file and that can look like any of the following: E1_40873886, E2_40873886, 40873886, 40873886-A, 40873886-B. I can write a long piece of code with if elif ... else but I am wondering if python's regex has a smarter way to do it. Also, I cant hardcode 40873886 like if '40873886' in {entry}: because the .csv file has a million entires and with varying numberIDs.

@TigerhawkT3 just extract "40873886" portion from above strings. Later I wanna append to an array which I am handling. — Nikhil Gupta, Jul 23 '15 at 23:49
@TigerhawkT3 I can't hardcode because there are like a million entries in the .csv file and i am temporarily saving in a variable "tempNode". so can something like `re.search('\d\d+', tempNode).group(0)` where tempNode has something that looks like one of the following: **E1_40873886, E2_40873886, 40873886, 40873886-A, 40873886-B** work? — Nikhil Gupta, Jul 23 '15 at 23:59
You'd replace the `'E1_45612786188a'` with a reference to the cell you're interested in. — TigerhawkT3, Jul 24 '15 at 00:00
@NikhilGupta, are there always eight digits? Also how many digits can appear after E1 etc... can there be E12345678_...? — Padraic Cunningham, Jul 24 '15 at 00:07
To extract one ore more digits followed by a word boundary, try [\d+\b](https://regex101.com/r/yH7aM9/1). — Jonny 5, Jul 24 '15 at 04:54

score 2 · Answer 1 · answered Jul 24 '15 at 01:32

2

Is this what you're looking for? It extracts every digit after the (optional) underscore.

import re
regex = re.compile("(?:[^_]*?_)?(\d*)(?:[^0-9])?")
#SampleNodenames
nodeNames  = ["E1_40873886", "E2_40873886", "40873886", "40873886-A", "40873886-B"] 
for nodeName in nodeNames:
    result = regex.match(nodeName)
    print result.group(1)

answered Jul 24 '15 at 01:32

AccidentallyC

92
1
8

This works. I have a followup question: what if **nodeNames** looks like this: `nodeNames=[nodeNames = ["40873886_A", "40873886_E1", "40873886_E2", "40873886_RES_NC-SC", "E1_40873886", "E2_40873886", "ELB: (A): 40873886_E1", "ELB: (A): 40873886_E2", "Node_40873886", "txug: p-38: 40873886", "40873886-A", "4087388-B"]` – Nikhil Gupta Jul 24 '15 at 17:10
Slr, is this like, literally how it looks like? Or are you implying that it's an array in an array? – AccidentallyC Jul 25 '15 at 05:33
I mean what if **nodeNames** can look like any of the above but want to extract *40873886* portion of it? – Nikhil Gupta Jul 25 '15 at 22:37
1

Well you could use Padraic's answer above, if u can ensure it always has 8 digits, if you can't assure it, you can modify his regex into "\d{3,}" that is if you can ensure the thing ur extracting has atleast 3 digits. [pastebin link](http://pastebin.com/hskbKcsx) – AccidentallyC Jul 26 '15 at 01:23

Padraic Cunningham · Accepted Answer · 2015-07-24T00:14:21.373

You can str.translate to just keep the digits, splittin on _ will either give you a single string or just substring after _, where you can remove any trailing - or uppercase letters :

from string import ascii_uppercase
nodeName.split("_")[-1].translate(None,ascii_uppercase+"-")

Output:

In [44]: nodeName = "E1_40873886"

In [45]: nodeName.split("_")[-1].translate(None,ascii_uppercase+"-")
Out[45]: '40873886'

In [46]: nodeName = "40873886-B"

In [47]: nodeName.split("_")[-1].translate(None,ascii_uppercase+"-")
Out[47]: '40873886'

In [48]: nodeName = "40873886"

In [49]: nodeName.split("_")[-1].translate(None,ascii_uppercase+"-")
Out[49]: '40873886'

You can also rstrip instead of translate:

nodeName.split("_")[-1].rstrip(ascii_uppercase+"-")

If you always have 8 consecutive digits you could also use a simple regex:

import  re
s = "E2_40873886"
print(re.search("\d{8}",nodeName).group())

@MoMo, it covers all the possible formats in the question so not sure what you mean — Padraic Cunningham, Jul 24 '15 at 00:02

Ignore all characters and just read numbers portion from the string

2 Answers2