Extracting Data with Python Regular Expressions

Question

I am having some trouble wrapping my head around Python regular expressions to come up with a regular expression to extract specific values.

The page I am trying to parse has a number of productIds which appear in the following format

\"productId\":\"111111\"

I need to extract all the values, 111111 in this case.

Have you read the [documentation on python regular expressions](http://docs.python.org/2/library/re.html)? — Joel Cornett, Apr 11 '13 at 20:39
Is it that you are new to regex, python, or both? Which part do you need help with? What have you tried? — cmd, Apr 11 '13 at 20:41
Possible duplicate of [how to extract a substring from inside a string in Python?](http://stackoverflow.com/questions/4666973/how-to-extract-a-substring-from-inside-a-string-in-python) — Андрей Беньковский, Nov 17 '15 at 17:11

score 35 · Accepted Answer · answered Apr 11 '13 at 20:54

35

t = "\"productId\":\"111111\""
m = re.match("\W*productId[^:]*:\D*(\d+)", t)
if m:
    print m.group(1)

meaning match non-word characters (\W*), then productId followed by non-column characters ([^:]*) and a :. Then match non-digits (\D*) and match and capture following digits ((\d+)).

Output

answered Apr 11 '13 at 20:54

perreal

94,503
21
155
181

Does this not need to be a raw string, or to have the backslashes escaped? – Tim MB Oct 21 '21 at 10:50

score 15 · Answer 2 · answered Apr 11 '13 at 20:40

15

something like this:

In [13]: s=r'\"productId\":\"111111\"'

In [14]: print s
\"productId\":\"111111\"

In [15]: import re

In [16]: re.findall(r'\d+', s)
Out[16]: ['111111']

answered Apr 11 '13 at 20:40

Fredrik Pihl

44,604
7
83
130

I find this more Pythonic. :) – skytreader Jun 02 '15 at 08:20

score 2 · Answer 3 · answered Apr 11 '13 at 20:43

The backslashes here might add to the confusion, because they are used as an escape character both by (non-raw) Python strings and by the regexp syntax.

This extracts the product ids from the format you posted:

re_prodId = re.compile(r'\\"productId\\":\\"([^"]+)\\"')

The raw string r'...' does away with one level of backslash escaping; the use of a single quote as the string delimiter does away with the need to escape double quotes; and finally the backslashe are doubled (only once) because of their special meaning in the regexp language.

You can use the regexp object's findall() method to find all matches in some text:

re_prodId.findall(text_to_search)

This will return a list of all product ids.

score 0 · Answer 4 · answered Apr 11 '13 at 20:40

0

Try this,

 :\\"(\d*)\\"

Give more examples of your data if this doesn't do what you want.

answered Apr 11 '13 at 20:40

frickskit

624
1
8
19

Extracting Data with Python Regular Expressions

4 Answers4

Linked

Related