please forgive my mistakes, add comment for any doubt
i was trying to scrape the data which is either contained in h2 and bold tag starting with a number from various blogs through regex, but i am getting only starting words of the sentence instead of full headline by using this regular expression
response.css('h2::text').re(r'\d+\.\s*\w+')
i don't know where i am wrong. The expected output should be like
the desired output is: [1. Golgappa at Chawla's and Nand's,2. Pyaaz
Kachori at Rawat Mishthan Bhandar,2. Pyaaz Kachori at Rawat Mishthan
Bhandar,4. Best of Indian Street Food at Masala Chowk,........ so on]
and [1. Keema Baati,2. Pyaaz Kachori ,3. Dal Baati Churma...so on]
and what i am getting is
2021-08-17 05:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.holidify.com/robots.txt> (referer: None)
2021-08-17 05:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.holidify.com/pages/street-food-in-jaipur-1483.html> (referer: None)
['1. Golgappa', '2. Pyaaz', '3. Masala', '4. Best', '5. Kaathi', '6. Pav', '7. Omelette', '8. Chicken', '9. Lassi', '10. Shrikhand', '11. Kulfi', '12. Sweets', '13. Fast', '14. Cold']
2021-08-17 05:55:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/> (referer: None)
['1. Keema', '2. Pyaaz', '3. Dal', '4. Shrikhand', '5. Ghewar', '6. Mawa', '7. Mirchi', '8. Gatte', '9. Rajasthani', '10. Laal']
2021-08-17 05:55:33 [scrapy.core.engine] INFO: Closing spider (finished)
please if you could suggest a regex would be a great help
if you want to visit the site then these are the sites i am scraping
https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/ and https://www.holidify.com/pages/street-food-in-jaipur-1483.html
here is my code in case you might want to see
import scrapy
import re
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['www.tasteatlas.com','www.lih.travel','www.crazymasalafood.com','www.holidify.com','www.jaipurcityblog.com','www.trip101.com','www.adequatetravel.com']
start_urls = ['https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/',
'https://www.holidify.com/pages/street-food-in-jaipur-1483.html'
]
def parse(self, response):
if response.css('h2::text').re(r'\d+\.\s*\w+'):
print(response.css('h2::text').re(r'\d+\.\s*\w+'))
elif response.css('b::text').re(r'\d+\.\s*\w+'):
print(response.css('b::text').re(r'\d+\.\s*\w+'))
``` and `````` from the above URLs, you get lots of unwanted data. Please look at their source code.
– Ram Aug 17 '21 at 07:46