How can I extract the data between a tag

Question

I am learning web scraping. I wrote the following code:

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url= 'DON'T WANT TO SHARE'
uClient= uReq(my_url)
page_html= uClient.read()
uClient.close()

page_soup= soup(page_html, "html.parser")
contents=page_soup.findAll("data")
print (contents)

Upon printing the contents I am getting something like this:

<data>
------------------------------------
SIM: B01N2W56MD
(P)UBLISHER NAME: Monster 
------------------------------------
(I)[ 0] Publisher: Monster 
(I)[ 1] Title: Monster 
(I)[12] Subject Keyword: nos
------------------------------------
(S)[ 0] Marketplace ID:  1
(S)[ 1] Replenishment Category:  Non Replenishable
(S)[ 5] Title type:  Main title 1
(S)[ 9] Product Group:  No operation Product Handling Group
(S)[19] Product Subcategory:  A
(S)[32] Are batteries required?:  N
------------------------------------
(K)[ 0] IDC: 030347493342
(K)[ 1] ORC: 6800532606463
------------------------------------
</data>

How can I extract these values and print or store them, i.e., the value of SIM or Title or IDC and ORC.

@AntonvBR you should never [parse html using regex](https://stackoverflow.com/a/1732454/893159). And when beautiful soup is already used, it should not be neccessary anyway. — allo, Nov 15 '17 at 11:54
@Anton vBR can u please tell me in details as i am new and learning so it will be very helpful to me. — Ribhujeet Das, Nov 16 '17 at 04:12
@RibhujeetDas Sorry I was merely trying to point you in the right direction. There are tons of material about BeautifulSoup and Regex that you can learn more about. Good luck! — Anton vBR, Nov 16 '17 at 06:37

Anonta · Answer 1 · 2017-11-15T11:26:25.523

0

You can get extract these values using regular expressions

import  re

data = """
<data>
------------------------------------
SIM: B01N2W56MD
(P)UBLISHER NAME: Monster 
------------------------------------
(I)[ 0] Publisher: Monster 
(I)[ 1] Title: Monster 
(I)[12] Subject Keyword: nos
------------------------------------
(S)[ 0] Marketplace ID:  1
(S)[ 1] Replenishment Category:  Non Replenishable
(S)[ 5] Title type:  Main title 1
(S)[ 9] Product Group:  No operation Product Handling Group
(S)[19] Product Subcategory:  A
(S)[32] Are batteries required?:  N
------------------------------------
(K)[ 0] IDC: 030347493342
(K)[ 1] ORC: 6800532606463
------------------------------------
</data>"""

sim= re.search(r'SIM:\s(.*?)\n', data).group(1)
dic= re.search(r'IDC:\s(.*?)\n', data).group(1)
title = re.search(r'Title:\s(.*?)\n', data).group(1)

print(sim)
print(dic)
print(title)

The code above simply looks for data within "SIM" and a "\n" (newline) and saves that data in a variable. Exactl same logic applies to finding value of "DIC" and "Title".

edited Nov 15 '17 at 11:26

answered Nov 15 '17 at 11:17

Anonta

2,500
2
15
25

thanks for the solution but in my case my whole data is stored in variable named contents and when i am trying to access from there m getting error page_soup= soup(page_html, "html.parser") contents=page_soup.findAll("pre") asin=re.search(r'SIM:\s(.*?)\n',contents).group(1) print(asin) m getting error return _compile(pattern, flags).search(string) TypeError: expected string or bytes-like object – Ribhujeet Das Nov 16 '17 at 04:16
return _compile(pattern, flags).search(string) TypeError: expected string or bytes-like object – Ribhujeet Das Nov 16 '17 at 04:19
here the whole data have to be kept in a variable as string but in my case i had extracted the data n stored in some other var and now directly want to extract some specific values. – Ribhujeet Das Nov 16 '17 at 04:26
Your `content` variable is an element object, you need to convert it into a string. Something like this should do it: `data = content.text` – Anonta Nov 16 '17 at 04:32
even after using ur command m getting error **"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?** – Ribhujeet Das Nov 16 '17 at 04:40
`page_soup.findAll("data")` returns a list of found element. You are trying to convert that list of elements into a string, you need to get the element out of the list first: `data = content[0].text` – Anonta Nov 16 '17 at 04:53
thanks that worked for me but now i m getting new error saying **name=re.search(r'Title:\s(.*?) \n',data).group(1) AttributeError: 'NoneType' object has no attribute 'group'** – Ribhujeet Das Nov 16 '17 at 05:48
i have one question n that is what if i have multiple IDC then how can i extract all those – Ribhujeet Das Nov 16 '17 at 07:23

How can I extract the data between a tag

1 Answers1