0

I want to extract myinformation_1 and myinformation_2 from the list below.

My code is not working yet.

Can you please help?

Thank you,HHC

import re
start = re.escape(">")
end   = re.escape("<")
stringlist =['<div class="ant-space-item"><a href="/holdings-of-1">myinformation_1</a></div>', 
    '<div class="ant-space-item"><a href="/holdings-of-2avbf">myinformation_2</a></div>']
for i in stringlist :
    result = re.search('%s(.*)%s' % (start, end), i).group(1)
    print(result)
accdias
  • 5,160
  • 3
  • 19
  • 31
Hary2
  • 51
  • 4

1 Answers1

1

Try with a more specific regex, e.g. <a href="/holdings-of-[^"]+">([^<]*) in this case:

import re

stringlist =['<div class="ant-space-item"><a href="/holdings-of-1">myinformation_1</a></div>',
    '<div class="ant-space-item"><a href="/holdings-of-2adf">myinformation_2</a></div>']

for i in stringlist:
    result = re.search(r'<a href="/holdings-of-[^"]+">([^<]*)', i).group(1)
    print(result)

Output:

myinformation_1
myinformation_2

Or as suggested in the comments, you can use a more "generalized" expression that works for any <a> tag, such as a regex like <a.*?>([^<]*).

rv.kvetch
  • 9,940
  • 3
  • 24
  • 53