0

My code works well, I have no problem extracting what I need. my problem is in some difference coming from using the response of the web service to a different result of doing the same but with the value of the web service saved in a variable. I have this blocker for days, and I hope you please help.

NOTE: the suggested duplicate questions answers don't work for me, this isn't a duplicate question.

I'm consuming a web service. the answer I get is stored in the variable answerService, this is a very long string and after this I extract what is inside the tag span that has this structure:

<span style = "font-weight: bold"> xxx </ span>
"xxx" is what I want to extract
 #with that I get the "xxx"
 arraySpan = re.findall(r'<span style="font-weight:bold">(.*?)<', answerService)

I get an array of "n" length according to the span existing with this structure.

If I do this directly from the web service it does not work and I only get this answer:

['áGILMENTE']

Now, if I put the response of the web service sameStringOfAnswer in my code, the result is different:

print(arraySpan)
['ADV', 'áGILMENTE']

By logic the answer is the same and never changes, for some strange reason in real time when I get the response from the web service, I only get ['áGILMENTE'] when the answer I expect is ['ADV', 'áGILMENTE']

This is the key piece that shows that 2 span is always coming with the structure I need:

Here is my code:

import requests
import re
session = requests.Session()

getId=session.get('http://cartago.lllf.uam.es/grampal/grampal.cgi')
cookie=session.cookies.get_dict()
getId=session.cookies.get_dict()
getId=getId["CGISESSID"]
#getting an ID for request a webservice
getService=requests.get("http://cartago.lllf.uam.es/grampal/grampal.cgi?m=analiza&csrf="+getId+"&e="+"ágilmente", cookies=cookie)

answerService=getService.text
#get the value of the <span>
arraySpan = re.findall(r'<span style="font-weight:bold">(.*?)<', answerService)
print(answerService)
print("array",arraySpan)

#same code but using the result of service web
sameStringOfAnswer='<html xmlns="http://www.w3.org/TR/REC-html40"><head><title>Grampal </title><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><meta name="Content-Language" content="EN"><meta name="author" content="jmguirao@ugr.es"><link rel="icon" type="image/ico" href="/favicon.ico"/><style type="text/css">html,body,form,ul,li,h1,h3,p{margin:0; padding:0}body{font-family: Arial, Helvetica, sans-serif; background-color:#fff}a{text-decoration: none;}a:hover{text-decoration: underline}ul{list-style-type: none}td{padding: 0.5pc 2pc 0pc 0pc}.nav{float: right; padding: 0.5pc 0.5pc 0.5pc 0.5pc; margin-left:5px}.nav li{display:inline; border-left: 1px solid #444; padding:0 0.4em;}.nav li.first{border-left:0}.hide{display:none}input{text-indent: 2px}input[type="submit"]{text-indent: 0}DIV.delPage{padding: 0.5ex 5em 0.5em 5em; background-color:#ffd6ba;}.delMain{padding: 2ex 0.5em 0.5pc 0.5em;}.post{margin-bottom: 0.25pc; font-size: 100%; padding-top: 0.5ex;}.posts, #posts{padding: 0.5ex 0.5em 0.5pc 50px;}.banner{padding: 0.5ex 0 0.5pc 0.5em;background-color: #ffc6aa;clear: both}.banner h1{font-weight: bolder; font-size: 150%;margin:0; padding:0 0 0 26px; display: inline;}h2{font-weight: bolder; font-size: 140%; color: red; margin:0; padding:0 0 0 26px; display: inline;}.resaltado{font-weight: bolder;font-size: 100%}</style></head><body><div class="banner"><ul class="hide"><li><a href="#content">skip to content</a></li></ul><ul class="nav">Análsis de:<li class="first"><a title="Analizador morfosintáctico" href="/grampal/grampal.cgi?m=analiza&e=ágilmente">palabras</a></li><li><a title="Desambiguador contextual" href="/grampal/grampal.cgi?m=etiqueta&e=ágilmente">oraciones</a></li><li><a title="Etiquetado de textos" href="/grampal/grampal.cgi?m=xml">textos</a></li><li><a title="Formas de una palabra" href="/grampal/grampal.cgi?m=genera&e=ágilmente">Generación de formas</a></li><!--<li><a title="Transcripción fonética" href="/grampal/grampal.cgi?m=transcribe&e=ágilmente">Transcripción</a></li>--><li><a href="/grampal/grampal.cgi?m=etiquetario">Etiquetario</a></li><li><a href="/grampal/grampal.cgi?m=autores">Autores</a></li></ul><h1>Grampal</h1></div><div class="delPage" style="font-size: 80%;"><form method="GET" action="/grampal/grampal.cgi"><input type="hidden" name="m" value="analiza"><input type="hidden" name="csrf" value="94508700a0ae409a90718299ae00b0e0"><span class="resaltado">Palabra : </span><input name="e" size="60" value="ágilmente"><input type="submit" value="Analiza"> &nbsp;</form></div><br><h2>ágilmente</h2><div class="delMain"><div id="posts"><table><tr><td style="font-style:italic;font-size:90%">categoría&nbsp;<span style="font-weight:bold"> ADV </span></td><td style="font-style:italic;font-size:90%">lema&nbsp;<span style="font-weight:bold"> áGILMENTE </span></td></tr></table></div></div></body></html>'
arraySpan = re.findall(r'<span style="font-weight:bold">(.*?)<', sameStringOfAnswer)
print(arraySpan)

What am I doing wrong?

Malekai
  • 4,765
  • 5
  • 25
  • 60
unusuario
  • 151
  • 1
  • 13
  • 1
    Why are you using regex to parse html? – Maximilian Burszley Mar 27 '19 at 14:52
  • @TheIncorrigible1 I'm new to python, maybe I'm doing some bad practice, but it's the way I found to extract what I need. – unusuario Mar 27 '19 at 14:53
  • @TheIncorrigible1 I ask you please do not mark my answer as resolved, beyond whether I am doing a bad practice, I have a functional code, and the problem I have could also occur if done differently. please I want you to see my problem, it's kind of weird. – unusuario Mar 27 '19 at 14:57
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Ralf Mar 27 '19 at 14:59
  • @Ralf is not duplicated, I ask you please do not mark my answer as duplicate. My code works well, I have no problem extracting what I need. my problem is in some difference coming from using the response of the web service to a different result of doing the same but with the value of the web service saved in a variable. I have this blocker for days, and I hope you please help. – unusuario Mar 27 '19 at 15:02
  • @unusuario if I try `answerService == sameStringOfAnswer` it tells me that the strings are not actually equal, so that is why there are different results of `.findall()` – Ralf Mar 27 '19 at 15:06
  • @Ralf The only thing I did was to receive the response from the web service, to use a tool on the web to minify the text (to put everything on the same line), this I did to compare the result, but I do not omit any character, it is practically the same response of the web service. also the have the same structure. – unusuario Mar 27 '19 at 15:13
  • @Ralf I had to minify the answer, since this text is very long and contains many spaces, so I would not know how to put a string like that in python. For that reason I minified the text. – unusuario Mar 27 '19 at 15:15
  • The minifier causes your difference. Look at my answer – Ralf Mar 27 '19 at 15:22

1 Answers1

2

The HTML from the webservice contains:

<span style="font-weight:bold"> ADV\n </span>

But your minified code contains the tag without the newline \n:

<span style="font-weight:bold"> ADV </span>

You can test the difference yourself:

>>> pattern = r'<span style="font-weight:bold">(.*?)<'
>>> re.findall(pattern, '<span style="font-weight:bold">AAA\n<')
[]
>>> re.findall(pattern, '<span style="font-weight:bold">AAA<')
['AAA']

That is why the are different. You should have mentioned that you use a minifier, as they alter the HTML and you can not use regex after that and still expect the same output.

This whole problem would have been avoided if you used an XML parser instead of regex, just like the linked question suggests: RegEx match open tags except XHTML self-contained tags

Ralf
  • 16,086
  • 4
  • 44
  • 68
  • You are a genius, I think I finally understand my problem, although in theory I am getting everything that is inside the , What is the best way or the solution to get what I need inside those tags ? – unusuario Mar 27 '19 at 15:30
  • The answers in [this question](https://stackoverflow.com/questions/33312175/matching-any-character-including-newlines-in-a-python-regex-subexpression-not-g) suggest using `([\s\S]*?)` (or some variation of it) instead of `(.*?)`. – Ralf Mar 27 '19 at 15:41
  • @unusuario you should read more about regex to get a good solution for your use case. – Ralf Mar 27 '19 at 15:41
  • You should really *really* use a parser. Try BeautifulSoup. Here's some code that does what you want to get you started. https://gist.github.com/akent/86dd72a085d452e8db5f4d76c3cce2c9 – akent Mar 27 '19 at 15:46