1

please forgive my mistakes, add comment for any doubt

i was trying to scrape the data which is either contained in h2 and bold tag starting with a number from various blogs through regex, but i am getting only starting words of the sentence instead of full headline by using this regular expression

 response.css('h2::text').re(r'\d+\.\s*\w+')

i don't know where i am wrong. The expected output should be like

 the desired output is: [1. Golgappa at Chawla's and Nand's,2. Pyaaz 
 Kachori at Rawat Mishthan Bhandar,2. Pyaaz Kachori at Rawat Mishthan 
 Bhandar,4. Best of Indian Street Food at Masala Chowk,........ so on] 
 and [1. Keema Baati,2. Pyaaz Kachori ,3. Dal Baati Churma...so on]

and what i am getting is

2021-08-17 05:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.holidify.com/robots.txt> (referer: None)
2021-08-17 05:55:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.holidify.com/pages/street-food-in-jaipur-1483.html> (referer: None)
['1. Golgappa', '2. Pyaaz', '3. Masala', '4. Best', '5. Kaathi', '6. Pav', '7. Omelette', '8. Chicken', '9. Lassi', '10. Shrikhand', '11. Kulfi', '12. Sweets', '13. Fast', '14. Cold']
2021-08-17 05:55:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/> (referer: None)
['1. Keema', '2. Pyaaz', '3. Dal', '4. Shrikhand', '5. Ghewar', '6. Mawa', '7. Mirchi', '8. Gatte', '9. Rajasthani', '10. Laal']
2021-08-17 05:55:33 [scrapy.core.engine] INFO: Closing spider (finished)

please if you could suggest a regex would be a great help

if you want to visit the site then these are the sites i am scraping

https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/ and https://www.holidify.com/pages/street-food-in-jaipur-1483.html

here is my code in case you might want to see

import scrapy
import re

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['www.tasteatlas.com','www.lih.travel','www.crazymasalafood.com','www.holidify.com','www.jaipurcityblog.com','www.trip101.com','www.adequatetravel.com']

    start_urls = ['https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/',
                  'https://www.holidify.com/pages/street-food-in-jaipur-1483.html'
                  ]

    def parse(self, response):


        if response.css('h2::text').re(r'\d+\.\s*\w+'):
            print(response.css('h2::text').re(r'\d+\.\s*\w+'))

        elif response.css('b::text').re(r'\d+\.\s*\w+'):
            print(response.css('b::text').re(r'\d+\.\s*\w+'))
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    [Don't use RegEx to parse HTML!](https://stackoverflow.com/a/1732454/15578194) [Use an HTML parser.](https://docs.python.org/3/library/html.parser.html) – no ai please Aug 17 '21 at 04:59
  • is this approach is helpful for scraping multiple odd websites at a same time – Iswar Chand Aug 17 '21 at 05:02
  • You want it to thread on multiple threads at the same time? – no ai please Aug 17 '21 at 05:03
  • sorry for being stupid but i was searching for an approach so that i can scrape the the list from multiple websites like this https://www.holidify.com/pages/street-food-in-jaipur-1483.html at the same time – Iswar Chand Aug 17 '21 at 05:09
  • Regex isn't suited for HTML parsing IMO. Why not use BeautifulSoup ? – Ram Aug 17 '21 at 07:34
  • can you show me how, because i didn't find any other way except regex – Iswar Chand Aug 17 '21 at 07:44
  • If you select everything that present inside ```

    ``` and `````` from the above URLs, you get lots of unwanted data. Please look at their source code.

    – Ram Aug 17 '21 at 07:46
  • this is why i am using regular expression because the targeted data starts with number and then a dot then space then some text 1. Golgappa at Chawla's and Nand's – Iswar Chand Aug 17 '21 at 08:18
  • If you are using Scrapy, why use regex ? You can select the data you need from those webpages using [Scrapy Selectors](https://docs.scrapy.org/en/latest/topics/selectors.html#topics-selectors). Regex complicates things. I have added an answer to scrape using ```beatifulsoup```. – Ram Aug 17 '21 at 08:23

2 Answers2

1

this can be done by newspaper library

import re
from newspaper import Article
import nltk
from pprint import pprint

urls=['https://www.jaipurcityblog.com/9-iconic-famous-dishes-of-jaipur-that-you-have-to-try/',

                  'https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/',

                  'https://www.lih.travel/famous-foods-in-jaipur/',
                  'https://www.holidify.com/pages/street-food-in-jaipur-1483.html']
extacted_data=[]
for url in urls:
    site = Article(url)

    site.download()
    site.parse()
    site.nlp()
    data= site.text
    pattern=re.findall(r'\d+\.\s*[a-zA-Z]+.*',data)
    print(pattern)

output:

['1. Dal Baati Churma', '2. Pyaaz Ki Kachori', '3. Gatte ki Sabji', '4. Mawa Kachori', '5. Kalakand', '6. Lassi', '7. Aam ki Launji', '8. Chokhani Kheer', '9. Mirchi   Vada']
['1. Keema Baati', '2. Pyaaz Kachori', '3. Dal Baati Churma', '4. Shrikhand', '5. Ghewar', '6. Mawa Kachori', '7. Mirchi Bada', '8. Gatte Ki Subzi', '9. Rajasthani Thali',     '10. Laal Maas']
['1. Rajasthani Thali (Plate) at Chokhi Dhani Village Resort', '2. Laal Maans at Handi', '3. Lassi at Lassiwala', '4. Anokhi Café for Penne Pasta & Cheese Cake', '5. Daal  Baluchi at Baluchi Restaurant', '6. Pyaz Kachori at Rawat', '7. Chicken Lollipop at Niro’s', '8. Hibiscus Ice Tea at Tapri', '9. Omelet at Sanjay Omelette', '1981. This    special egg eatery of Jaipur also treats some never tried before egg specialties. If you are an egg-fan with a sweet tooth, then this is your place. Slurp the “Egg Rabri”  of Sanjay Omelette and feel the heavenly juice of eggs in your mouth. Appreciate the good taste of egg in never before way with just a visit to “Sanjay Omelette”.', '10.   Paalak Paneer & Missi Roti at Sharma Dhabha']
["1. Golgappa at Chawla's and Nand's", '2. Pyaaz Kachori at Rawat Mishthan Bhandar', '3. Masala Chai at Gulab Ji Chaiwala', '4. Best of Indian Street Food at Masala    Chowk', '5. Kaathi Roll at Al Bake', "6. Pav Bhaji at Pandit's", "7. Omelette at Sanjay's", '8. Chicken Tikka at Sethi Bar-Be-Que', '9. Lassi at Lassiwala', '10. Shrikhand     at Falahaar', '11. Kulfi Faluda at Bapu Bazaar', '12. Sweets from Laxmi Mishthan Bhandar (LMB)', "13. Fast Food at Aunty's Cafe", '14. Cold Coffee at Gyan Vihar Dairy  (GVD)']
0

Here's another approach using scrapy as in the question, which unlike the answer from Fazlul doesn't tear apart text in child nodes from that in the parent node.

    def parse(self, response):
        r = re.compile(r'\d+\.')
        # get header texts:
        h2s = [e.xpath('string()').extract_first() for e in response.xpath('//h2')]
        nh2s = list(filter(r.match, h2s))       # get numbered headers
        if nh2s: print(nh2s)
        …
Armali
  • 18,255
  • 14
  • 57
  • 171