2

I have the following inputs and desired outputs that I wish to replace in a HTML document, maybe using regular expressions or string replace.

if :
input: '<b>º </b>' 
output: ['º']

input: '<b>Nº </b>' 
output: []

input: '<b>1º </b>' 
output: []

input: '<b>1ª </b>' 
output: []

input: '<p>N<u>º </u></p>' 
output: ['º']

Attempt

l = [ ('<b>º </b>', ['º']), ('<b>Nº </b>', [])]

result = None
for i in l:
    codigo = re.sub(r'<(b|sup|s|u)>\s*[oº]\s*</(b|sup|s|u)>', 'º ', i[0], re.I)
    soup = BeautifulSoup(codigo, 'html.parser')
    result = soup.find_all('b', string='º')
    assert str(result) == l[1], "ops.."

How do I solve this problem?

Emma
  • 27,428
  • 11
  • 44
  • 69
britodfbr
  • 1,747
  • 14
  • 16

1 Answers1

0

I would try this: first, add your inputs to a list:

codi = ['<b>º </b>' ,'<b>Nº </b>' ,'<b>1º </b>', '<b>1ª </b>','<p>N<u>º </u></p>'  ]

Then process the list with BS:

for i in codi:
   soup = bs(i,'html.parser')
   print('input:',i)
   targets = soup.select('*:contains(º)')
   for target in targets:
       if  target.text.strip() == 'º':
           print('output:',target.text.strip())        
   print('--------------')

Output:

input: <b>º </b>
output º
--------------
input: <b>Nº </b>
--------------
input: <b>1º </b>
--------------
input: <b>1ª </b>
--------------
input: <p>N<u>º </u></p>
output º
--------------

Credit for the approach: numerous answers from @QHarr - the king of soup.select().

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • Jack Fleeting this implementation is fasten about with regex? – britodfbr May 12 '19 at 11:39
  • 1
    @britodfbr - I don't know if it's faster (haven't tested it), but I personally dislike regex and if you google around you'll see that experts try to discourage the use of regex with html code. So I generally try to avoid it at all costs :) – Jack Fleeting May 12 '19 at 11:53