0

I have stuck with regex syntax. I am trying to create a regex for html code, that looks for a specific string, which is located in a table and gives you back the next column value next to our search string.

[u'<table> <tr> <td>Ingatlan \xe1llapota</td> <td>fel\xfaj\xedtott</td> </tr> <tr> <td>\xc9p\xedt\xe9s \xe9ve</td> <td>2018</td> </tr> <tr> <td>Komfort</td> <td>luxus</td> </tr> <tr> <td>Energiatan\xfas\xedtv\xe1ny</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Emelet</td> <td>1</td> </tr> <tr> <td>\xc9p\xfclet szintjei</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Lift</td> <td>van</td> </tr> <tr> <td>Belmagass\xe1g</td> <td>3 m vagy magasabb</td> </tr> <tr> <td>F\u0171t\xe9s</td> <td>g\xe1z (cirko)</td> </tr> <tr> <td>L\xe9gkondicion\xe1l\xf3</td> <td>van</td> </tr> </table>', u'<table> <tr> <td>Akad\xe1lymentes\xedtett</td> <td>nem</td> </tr> <tr> <td>F\xfcrd\u0151 \xe9s WC</td> <td>k\xfcl\xf6n \xe9s atlan \xe1llapota')

So I would like to create a regex to look for "Ingatlan \xe1llapota" and return "fel\xfaj\xedtott": Ingatlan \xe1llapota fel\xfaj\xedtott

My current regex expression is the following: \bIngatlan állapota\s+(.*) I would need to incorporate the td tags and to limit how long string would it return after the search string(Ingatlan állapota)

Any help is much appreciated. Thanks!

adr
  • 1
  • Possible duplicate of [Can you provide some examples of why it is hard to parse XML and HTML with a regex?](https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg) – hellow Jul 30 '18 at 09:27
  • In short: don't! Use a proper parser for this job. – hellow Jul 30 '18 at 09:28
  • 1
    I think [Tony the Pony](https://stackoverflow.com/a/1732454/104349) wants a word with you... – Daniel Roseman Jul 30 '18 at 09:28

1 Answers1

0

As pointed out before use xpath or css instead:

import scrapy

class txt_filter:
    sterm='Ingatlan \xe1llapota'
    txt= '''<table> <tr> <td>Ingatlan \xe1llapota</td> <td>fel\xfaj\xedtott</td> </tr> <tr> <td>\xc9p\xedt\xe9s \xe9ve</td> <td>2018</td> </tr> <tr> <td>Komfort</td> <td>luxus</td> </tr> <tr> <td>Energiatan\xfas\xedtv\xe1ny</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Emelet</td> <td>1</td> </tr> <tr> <td>\xc9p\xfclet szintjei</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Lift</td> <td>van</td> </tr> <tr> <td>Belmagass\xe1g</td> <td>3 m vagy magasabb</td> </tr> <tr> <td>F\u0171t\xe9s</td> <td>g\xe1z (cirko)</td> </tr> <tr> <td>L\xe9gkondicion\xe1l\xf3</td> <td>van</td> </tr> </table>', u'<table> <tr> <td>Akad\xe1lymentes\xedtett</td> <td>nem</td> </tr> <tr> <td>F\xfcrd\u0151 \xe9s WC</td> <td>k\xfcl\xf6n \xe9s atlan </td></tr></table>
    '''
    resp = scrapy.http.response.text.TextResponse(body=txt,url='abc',encoding='utf-8')
    print(resp.xpath('.//td[.="'+sterm+'"]/following-sibling::td[1]/text()').extract())

Result:

$ python3 so_51590811.py 
['felújított']
Thomas Strub
  • 1,275
  • 7
  • 20
  • Thanks for the help! Actually I cannot use xpath or css as the table I am parsing is filled dynamically based on the fill out of the details page. If one of the fields are not filled, then it wont be visualized and another type of field will be placed in that table cell. – adr Jul 31 '18 at 11:44