0

I am trying to get the author information from the website "pixiv". The code is here from the website:

<meta property="og:title" content="ラララ | かるは [pixiv]">

I want to get that "かるは" and I use the regex:

[\u0800-\u9fa5_a-zA-Z0-9_]+(?=\s\[pixiv\])

However, in Python, I cannot get anything return. (P.S. websiteCode is the source code of the website, I have tried printing it out and it is correct. Specifically, there is

<meta property="og:title" content="ラララ | かるは [pixiv]">

inside):

Here is my Python code:

authorPattern = re.compile(r'[\u0800-\u9fa5_a-zA-Z0-9_]+(?=\s\[pixiv\])')
tempAuthor = re.search(authorPattern, websiteCode)
print("temp: ", tempAuthor)

The output is:

Traceback (most recent call last):
  File "/Users/ChinYuer/Software-Engineering/Pixiv-Spider/pixiv.py", line 191, in <module>
    my.grab_image()
  File "/Users/ChinYuer/Software-Engineering/Pixiv-Spider/pixiv.py", line 84, in grab_image
    testAuthor = tempAuthor.group()
AttributeError: 'NoneType' object has no attribute 'group'

I tried my regex code on some testing websites and it worked fine.

This is really frustrating and I will really appreciate if anyone can help me out.

Thanks ahead again!

2 Answers2

1

Assuming your code is written for Python 3, it works correctly from Python 3.3 and above, and fails with the same error message for Python 3.2.x and below.

Solution

The simplest solution is to run your code in Python 3.3 or above, and add a version guard to prevent lower version of Python to run your code.

The second solution is to use regular Unicode string literal, in which Unicode escape sequences are recognized and processed. The drawback of this method is that you have to mind the escape sequences and double up the \ when necessary, especially in the case of \b, which is interpreted as backspace character in regular Unicode string literal before it reaches re.compile.

# Python 3.2.5 (default, Jul 25 2014, 14:13:17)
>>> print('[\u0800-\u9fa5_a-zA-Z0-9_]+(?=\s\[pixiv\])')
[ࠀ-龥_a-zA-Z0-9_]+(?=\s\[pixiv\])

>>> import re
>>> re.compile('[\u0800-\u9fa5_a-zA-Z0-9_]+(?=\s\[pixiv\])', re.DEBUG)
max_repeat 1 4294967295
  in
    range (2048, 40869)
    literal 95
    range (97, 122)
    range (65, 90)
    range (48, 57)
    literal 95
assert 1
  in
    category category_space
  literal 91
  literal 112
  literal 105
  literal 120
  literal 105
  literal 118
  literal 93
<_sre.SRE_Pattern object at 0x6001fad70>

By the way, you might want to review your character range \u0800-\u9fa5, since it also matches Arabic, Devanagari, Thai, Lao, Box Drawing, Symbols etc.

Explanation

Unicode escape sequences \u and \U in raw Unicode string

In Python 3, Unicode escape sequences \u and \U are not treated specially in raw Unicode string, as specified in Python 3.0. The specification of string literal is updated in Python 3.3 to add u prefix for easier maintenance of Python 2 code, but it doesn't change the parsing behavior for raw Unicode string:

# Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:43:06) [MSC v.1600 32 bit (Intel)] on win32
>>> r'[\u8000]'
'[\\u8000]'
>>> '[\u8000]'
'[耀]'

This is in contrast with Python 2, where Unicode escape sequences are processed into corresponding Unicode character even in raw Unicode string literal:

# Python 2.7.8 (default, Jul 25 2014, 14:04:36)
>>> print(u'\u8000')
耀
>>> print(ur'\u8000')
耀

Therefore, the string containing the regex in the question, as seen by the regex engine in Python 3:

>>> print(r'[\u0800-\u9fa5_a-zA-Z0-9_]+(?=\s\[pixiv\])')
[\u0800-\u9fa5_a-zA-Z0-9_]+(?=\s\[pixiv\])

Support for Unicode escape sequence \u and \U in re package

Before Python 3.3, re package doesn't support \u and \U Unicode escape sequence, as seen in documentation for Python 3.2. As a result, \u and \U are interpreted as matching literal u and U.

Adding re.DEBUG flag, you can see the resulting structure of the compiled regex. I annotate part of the output for clarity:

# Python 3.2.5 (default, Jul 25 2014, 14:13:17)
>>> import re
>>> re.compile(r'[\u0800-\u9fa5_a-zA-Z0-9_]+(?=\s\[pixiv\])', re.DEBUG)
max_repeat 1 4294967295
  in
    literal 117      # u (\u)
    literal 48       # 0
    literal 56       # 8
    literal 48       # 0
    range (48, 117)  # 0-u (0-\u)
    literal 57       # 9
    literal 102      # f
    literal 97       # a
    literal 53       # 5
    literal 95
    range (97, 122)
    range (65, 90)
    range (48, 57)
    literal 95
assert 1
  in
    category category_space
  literal 91
  literal 112
  literal 105
  literal 120
  literal 105
  literal 118
  literal 93
<_sre.SRE_Pattern object at 0x600178850>

Python 3.3 finally added support for Unicode escape sequence in re package, so it works correctly for subsequent versions:

# Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:43:06) [MSC v.1600 32 bit (Intel)] on win32
>>> re.compile(r'[\u0800-\u9fa5_a-zA-Z0-9_]+(?=\s\[pixiv\])', re.DEBUG);
max_repeat 1 2147483647
  in
    range (2048, 40869) # \u0800-\u9fa5
    literal 95
    range (97, 122)
    range (65, 90)
    range (48, 57)
    literal 95
assert 1
  in
    category category_space
  literal 91
  literal 112
  literal 105
  literal 120
  literal 105
  literal 118
  literal 93
Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
0

The original code works correctly in Python 3. However, the u strings prefixes are required in Python 2:

import re

websiteCode = u'<meta property="og:title" content="ラララ | かるは [pixiv]">'
authorPattern = re.compile(ur'[\u0800-\u9fa5_a-zA-Z0-9_]+(?=\s\[pixiv\])')
tempAuthor = re.search(authorPattern, websiteCode)
print(u"temp: " + tempAuthor.group(0))
dlask
  • 8,776
  • 1
  • 26
  • 30