Dealing with "\n\t\t" with regex

Question

I have the following substring in the string str(dList):

"addressRegion">\n\t\t\t\t\t\t\t\t\tMA\n\t\t\t\t\t\t\t\t</span>

I am trying to use re.search to pull out "MA" using this:

state = re.search(r'"addressRegion">\n\t\t\t\t\t\t\t\t\t(.+?)\n\t',str(dList))

however, that doesn't seem to work. I understand this is possibly because of the the way "/" is handled. I can't figure out how to deal with this.

You are using regex to parse HTML? [Please don't](http://stackoverflow.com/a/1732454/2308683) — OneCricketeer, Feb 11 '16 at 21:31
I am actually using BeautifulSoup, I am using regex for the finer details. (substrings). — krthkskmr, Feb 11 '16 at 21:33
Can you get the text within that span tag? Then strip the whitespace of `\t` and `\n`? — OneCricketeer, Feb 11 '16 at 21:34

score 2 · Answer 1 · answered Feb 11 '16 at 21:37

2

Regex is really not necessary

In [22]: str = '<span class="addressRegion">\n\t\t\t\t\t\t\t\t\tMA\n\t\t\t\t\t\t\t\t</span>'

In [23]: from bs4 import BeautifulSoup

In [24]: soup = BeautifulSoup(str, 'html.parser')

In [25]: soup.text
Out[25]: u'\n\t\t\t\t\t\t\t\t\tMA\n\t\t\t\t\t\t\t\t'

In [26]: soup.text.strip()
Out[26]: u'MA'

answered Feb 11 '16 at 21:37

OneCricketeer

179,855
19
132
245

1

I agree, The no regex solution is probably the better approach. – yurib Feb 11 '16 at 21:39
The out at soup.text is u'\\n\\t\\t\\t instead of u'\n\t\t\t and so, soup.text.strip() doesn't do anything. But yes, I can see how this a much better approach. – krthkskmr Feb 11 '16 at 21:59
@krthkskmr - Whatever you are viewing that in is escaping the backslashes – OneCricketeer Feb 11 '16 at 22:01

score 1 · Answer 2 · answered Feb 11 '16 at 21:36

1

update This is how you could do it if you really wanted to use regex, but I think @cricket_007's solution is the better approach.

All you need to do is to escape the backslash with another backslash. You can also get rid of the repetitions of '\t':

>>> s = '"addressRegion">\n\t\t\t\t\t\t\t\t\tMA\n\t\t\t\t\t\t\t\t</span>'
>>> re.search('.*\\n(\\t)+(.*?)\\n(\\t)+.*',s).group(2)
'MA'

answered Feb 11 '16 at 21:36

yurib

8,043
3
30
55

I'll revoke my downvote for at least accepting regex is not the recommended approach. :) – OneCricketeer Feb 11 '16 at 21:41

Dealing with "\n\t\t" with regex

2 Answers2