1

I am learning web scraping. I wrote the following code:

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url= 'DON'T WANT TO SHARE'
uClient= uReq(my_url)
page_html= uClient.read()
uClient.close()

page_soup= soup(page_html, "html.parser")
contents=page_soup.findAll("data")
print (contents)

Upon printing the contents I am getting something like this:

<data>
------------------------------------
SIM: B01N2W56MD
(P)UBLISHER NAME: Monster 
------------------------------------
(I)[ 0] Publisher: Monster 
(I)[ 1] Title: Monster 
(I)[12] Subject Keyword: nos
------------------------------------
(S)[ 0] Marketplace ID:  1
(S)[ 1] Replenishment Category:  Non Replenishable
(S)[ 5] Title type:  Main title 1
(S)[ 9] Product Group:  No operation Product Handling Group
(S)[19] Product Subcategory:  A
(S)[32] Are batteries required?:  N
------------------------------------
(K)[ 0] IDC: 030347493342
(K)[ 1] ORC: 6800532606463
------------------------------------
</data>

How can I extract these values and print or store them, i.e., the value of SIM or Title or IDC and ORC.

rlandster
  • 7,294
  • 14
  • 58
  • 96
Ribhujeet Das
  • 39
  • 2
  • 9
  • Try the google search: extract text with regex python – Anton vBR Nov 15 '17 at 11:16
  • @AntonvBR you should never [parse html using regex](https://stackoverflow.com/a/1732454/893159). And when beautiful soup is already used, it should not be neccessary anyway. – allo Nov 15 '17 at 11:54
  • @Anton vBR can u please tell me in details as i am new and learning so it will be very helpful to me. – Ribhujeet Das Nov 16 '17 at 04:12
  • @RibhujeetDas Sorry I was merely trying to point you in the right direction. There are tons of material about BeautifulSoup and Regex that you can learn more about. Good luck! – Anton vBR Nov 16 '17 at 06:37

1 Answers1

0

You can get extract these values using regular expressions

import  re

data = """
<data>
------------------------------------
SIM: B01N2W56MD
(P)UBLISHER NAME: Monster 
------------------------------------
(I)[ 0] Publisher: Monster 
(I)[ 1] Title: Monster 
(I)[12] Subject Keyword: nos
------------------------------------
(S)[ 0] Marketplace ID:  1
(S)[ 1] Replenishment Category:  Non Replenishable
(S)[ 5] Title type:  Main title 1
(S)[ 9] Product Group:  No operation Product Handling Group
(S)[19] Product Subcategory:  A
(S)[32] Are batteries required?:  N
------------------------------------
(K)[ 0] IDC: 030347493342
(K)[ 1] ORC: 6800532606463
------------------------------------
</data>"""

sim= re.search(r'SIM:\s(.*?)\n', data).group(1)
dic= re.search(r'IDC:\s(.*?)\n', data).group(1)
title = re.search(r'Title:\s(.*?)\n', data).group(1)

print(sim)
print(dic)
print(title)

The code above simply looks for data within "SIM" and a "\n" (newline) and saves that data in a variable. Exactl same logic applies to finding value of "DIC" and "Title".

Anonta
  • 2,500
  • 2
  • 15
  • 25
  • thanks for the solution but in my case my whole data is stored in variable named contents and when i am trying to access from there m getting error page_soup= soup(page_html, "html.parser") contents=page_soup.findAll("pre") asin=re.search(r'SIM:\s(.*?)\n',contents).group(1) print(asin) m getting error return _compile(pattern, flags).search(string) TypeError: expected string or bytes-like object – Ribhujeet Das Nov 16 '17 at 04:16
  • return _compile(pattern, flags).search(string) TypeError: expected string or bytes-like object – Ribhujeet Das Nov 16 '17 at 04:19
  • here the whole data have to be kept in a variable as string but in my case i had extracted the data n stored in some other var and now directly want to extract some specific values. – Ribhujeet Das Nov 16 '17 at 04:26
  • Your `content` variable is an element object, you need to convert it into a string. Something like this should do it: `data = content.text` – Anonta Nov 16 '17 at 04:32
  • even after using ur command m getting error **"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?** – Ribhujeet Das Nov 16 '17 at 04:40
  • `page_soup.findAll("data")` returns a list of found element. You are trying to convert that list of elements into a string, you need to get the element out of the list first: `data = content[0].text` – Anonta Nov 16 '17 at 04:53
  • thanks that worked for me but now i m getting new error saying **name=re.search(r'Title:\s(.*?) \n',data).group(1) AttributeError: 'NoneType' object has no attribute 'group'** – Ribhujeet Das Nov 16 '17 at 05:48
  • i have one question n that is what if i have multiple IDC then how can i extract all those – Ribhujeet Das Nov 16 '17 at 07:23