0

I want to find a specific two words in a string id and name, I think use Regular Expressions, but I can not format.

In file I have:

<p>Any text, bla bla lorem ipsum, bla bla</p>
<p>test = {"player":{"id":"123123","name":"f_teste"};

Here is my progress:

import re

def main():
    padrao = r'"id"\w+'

    caminho = 'D:\index.txt'
    arquivo = open(caminho,'r')
    texto = arquivo.readlines()[1].split('{')

    textoEncontrado = texto[2].split(',')

    print textoEncontrado[0]
    print textoEncontrado[1]

    arquivo.close()


if __name__ == '__main__':
    main()

Result:

"id":"123123"
"name":"f_teste"};

What I want:

id: 123123
name = f_teste

When I try get only string id using RE, I got:

padrao = r'^id$'
(...)
result = re.findall(padrao,textoEncontrado[0])
    print result
(...)

Result is []

Sorry for bad english. Thanks all. :)

Filipe Manuel
  • 967
  • 2
  • 14
  • 33
  • 1
    So... you're using a regex to parse JSON out of HTML. – Wug Aug 07 '12 at 19:40
  • @Wug Can it be considered a new level of [Force](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)? – Lev Levitsky Aug 07 '12 at 19:45
  • The default will be always this. – Filipe Manuel Aug 07 '12 at 19:45
  • 3
    Use [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) to parse HTML, and then the json module to parse what you've extracted from your HTML. I don't know how you ended up with this data structure, but that's your best bet. You should never use regex to parse things such as xml, html, json, but instead you should use the parsers already made available to you. No need to reinvent the wheel. – Lanaru Aug 07 '12 at 19:42
  • @LevLevitsky: I think I can feel zalgo coming out of my face. Also, my favorite answer to that question is this one: http://stackoverflow.com/a/5236278/1462604 – Wug Aug 07 '12 at 19:48

1 Answers1

2

If your input is a valid html that contains a json text in it:

>>> from bs4 import BeautifulSoup
>>> html = """<p>Any text, bla bla lorem ipsum, bla bla</p>
... <p>test = {"player":{"id":"123123","name":"f_teste"}};"""
>>> soup = BeautifulSoup(html)
>>> import  re
>>> jsonre = re.compile(r'test\s*=\s*(.*);', re.DOTALL)
>>> p = soup('p', text=jsonre)[0]
>>> json_text = jsonre.search(p.get_text()).group(1)
>>> import json
>>> json.loads(json_text)
{u'player': {u'id': u'123123', u'name': u'f_teste'}}

To install bs4, run: pip install beautifulsoup4.

A regex solution would look like:

>>> re.findall(r'"(id)":"([^"]*)","(name)":"([^"]*)"', html)
[('id', '123123', 'name', 'f_teste')]
jfs
  • 399,953
  • 195
  • 994
  • 1,670