1

Im trying to find and print a certain amount of text using the findall function. I cannot seem to make it work regardless. Im trying to print and store all information in a variable before using the findall function again to get exactly what I want. I have to do it in two steps as If i search directly for the src I get other junk with it from other areas of the document.

This is what Ive done so far.

## locate a section of text containging the img source
html_img_source_and_junk = findall('</noscript>[\s]+<img[\s]+src="([^"]+)"[\s]+alt', html_source_whittakers)
print(html_img_source_and_junk)

This is the text of information im trying to extract.

noscript>

< img

  src="//cdn.shopify.com/s/files/1/0274/7315/products/whi_225x225.jpg?v=1525431190"

alt="
Jason
  • 39
  • 2
  • It's unclear if what's not working for you is the `regex` or `findall`. Have you looked at this https://stackoverflow.com/questions/7752551/python-regex-findall? Maybe your regex doesn't match again in other places? Can you test it on regex101.com ? – sal May 18 '20 at 01:53
  • 1
    I ended up solving it. I realised that my the sections of the hmtl files I wanted all had the same sequence so I simply did this. image_matches_whittakers = findall('"noscript"[\s]+srcset="//(cdn.[^"]+)[\s]+1x', html_source_whittakers) print(image_matches_whittakers) Note I did end up changing the code I wanted to extract – Jason May 18 '20 at 03:01

0 Answers0