1

I have a regex related question. If I have a variable name nodeName That I am reading from a .csv file and that can look like any of the following: E1_40873886, E2_40873886, 40873886, 40873886-A, 40873886-B. I can write a long piece of code with if elif ... else but I am wondering if python's regex has a smarter way to do it. Also, I cant hardcode 40873886 like if '40873886' in {entry}: because the .csv file has a million entires and with varying numberIDs.

Nikhil Gupta
  • 551
  • 10
  • 28
  • What do you actually want to _do_? – TigerhawkT3 Jul 23 '15 at 23:48
  • @TigerhawkT3 just extract "40873886" portion from above strings. Later I wanna append to an array which I am handling. – Nikhil Gupta Jul 23 '15 at 23:49
  • Like `re.search('\d\d+', 'E1_45612786188a').group(0)`? – TigerhawkT3 Jul 23 '15 at 23:54
  • @TigerhawkT3 I can't hardcode because there are like a million entries in the .csv file and i am temporarily saving in a variable "tempNode". so can something like `re.search('\d\d+', tempNode).group(0)` where tempNode has something that looks like one of the following: **E1_40873886, E2_40873886, 40873886, 40873886-A, 40873886-B** work? – Nikhil Gupta Jul 23 '15 at 23:59
  • You'd replace the `'E1_45612786188a'` with a reference to the cell you're interested in. – TigerhawkT3 Jul 24 '15 at 00:00
  • @NikhilGupta, are there always eight digits? Also how many digits can appear after E1 etc... can there be E12345678_...? – Padraic Cunningham Jul 24 '15 at 00:07
  • To extract one ore more digits followed by a word boundary, try [\d+\b](https://regex101.com/r/yH7aM9/1). – Jonny 5 Jul 24 '15 at 04:54

2 Answers2

2

Is this what you're looking for? It extracts every digit after the (optional) underscore.

import re
regex = re.compile("(?:[^_]*?_)?(\d*)(?:[^0-9])?")
#SampleNodenames
nodeNames  = ["E1_40873886", "E2_40873886", "40873886", "40873886-A", "40873886-B"] 
for nodeName in nodeNames:
    result = regex.match(nodeName)
    print result.group(1)
AccidentallyC
  • 92
  • 1
  • 8
  • This works. I have a followup question: what if **nodeNames** looks like this: `nodeNames=[nodeNames = ["40873886_A", "40873886_E1", "40873886_E2", "40873886_RES_NC-SC", "E1_40873886", "E2_40873886", "ELB: (A): 40873886_E1", "ELB: (A): 40873886_E2", "Node_40873886", "txug: p-38: 40873886", "40873886-A", "4087388-B"]` – Nikhil Gupta Jul 24 '15 at 17:10
  • Slr, is this like, literally how it looks like? Or are you implying that it's an array in an array? – AccidentallyC Jul 25 '15 at 05:33
  • I mean what if **nodeNames** can look like any of the above but want to extract *40873886* portion of it? – Nikhil Gupta Jul 25 '15 at 22:37
  • 1
    Well you could use Padraic's answer above, if u can ensure it always has 8 digits, if you can't assure it, you can modify his regex into "\d{3,}" that is if you can ensure the thing ur extracting has atleast 3 digits. [pastebin link](http://pastebin.com/hskbKcsx) – AccidentallyC Jul 26 '15 at 01:23
1

You can str.translate to just keep the digits, splittin on _ will either give you a single string or just substring after _, where you can remove any trailing - or uppercase letters :

from string import ascii_uppercase
nodeName.split("_")[-1].translate(None,ascii_uppercase+"-")

Output:

In [44]: nodeName = "E1_40873886"

In [45]: nodeName.split("_")[-1].translate(None,ascii_uppercase+"-")
Out[45]: '40873886'

In [46]: nodeName = "40873886-B"

In [47]: nodeName.split("_")[-1].translate(None,ascii_uppercase+"-")
Out[47]: '40873886'

In [48]: nodeName = "40873886"

In [49]: nodeName.split("_")[-1].translate(None,ascii_uppercase+"-")
Out[49]: '40873886'

You can also rstrip instead of translate:

nodeName.split("_")[-1].rstrip(ascii_uppercase+"-")

If you always have 8 consecutive digits you could also use a simple regex:

import  re
s = "E2_40873886"
print(re.search("\d{8}",nodeName).group())
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321