0

I am fairly new to Python myself. For training purposes, I am trying to scrape some data from a website. Digging through the HTML/CSS of said website taught me that it isn't that simple because most div's etc don't have a class or ID.

<table class="trade-list-table max-width">
<thead>
</thead>

<tbody>

<tr class="cursor-pointer" data-on-click-link="/pc/Trade/Detail/313809613" data-on-click-link-action="NewWindow" data-toggle="tooltip" data-original-title="" title="">
<td>
<img class="trade-item-icon item-quality-legendary" alt="Icon" src="./Search Result - Tamriel Trade Centre_files/crafting_outfitter_potion_014.png" data-original-title="" title="">
<div class="item-quality-legendary">
XXSTRING1XX
</div>
<div>
Level:
<img class="small-icon" src="./Search Result - Tamriel Trade Centre_files/nonvet.png">
XXSTRING2XX
</div>
</td>

<td class="hidden-xs">
<div class="text-small-width                     text-danger">
XXSTRING3XX
</div>
</td>

<td class="hidden-xs">
<div>
XXSTRING4XX
</div>
<div>
XXSTRING5XX
</div>
</td>

<td class="gold-amount bold">
<img class="small-icon" src="./Search Result - Tamriel Trade Centre_files/gold.png">
XXSTRING6XX
<div class="text-danger">
X
</div>
<img class="small-icon" src="./Search Result - Tamriel Trade Centre_files/amount.png">
XXSTRING7XX
<div class="text-danger">
=
</div>
<img class="small-icon" src="./Search Result - Tamriel Trade Centre_files/gold.png">
54,999
</td>

<td class="bold hidden-xs" data-mins-elapsed="2">Now</td>
</tr>

I tried many things. I've been struggling for the past 7 days. When I print the result I need XXSTRING1XX until XXSTRING7XX so that I can push them into a .csv file or something similar.

The difficulty I've been having is that most div's don't have a specific class. And in most cases, I am unable to return a string.

I've been using Python with Requests and BeautifulSoup from bs4.

import requests
from bs4 import BeautifulSoup

page = requests.get('https://eu.tamrieltradecentre.com/pc/Trade/SearchResult?ItemID=211&SearchType=Sell&ItemNamePattern=Dreugh+Wax&ItemCategory1ID=&ItemCategory2ID=&ItemCategory3ID=&ItemTraitID=&ItemQualityID=&IsChampionPoint=false&LevelMin=&LevelMax=&MasterWritVoucherMin=&MasterWritVoucherMax=&AmountMin=&AmountMax=&PriceMin=&PriceMax=')
soup = BeautifulSoup(page.content, 'html.parser')

container = soup.find(class_="trade-list-table max-width")

itembox = container.find_all(class_="cursor-pointer")

item = itembox[0]

# Select all table rows and first TD
tr = container.find_all(class_="cursor-pointer")
tr1 = tr[0].find_all('td')

# Itemname
itemname = item.find('div', class_="item-quality-legendary").get_text()
print (itemname)

# Itemlevel + level type
# Tradername
# Location
# Guild name
# Unit price
# Quantity
# Total price
# Timestamp?
  • 3
    Any code that you might have written or any failed code? Any reproducible code with inputs, outputs and import files may be more helpful for us to debug – LazyCoder Jul 26 '19 at 20:51
  • 1
    @Bamieschijf What do you actually want? a solution to how to extract data from Html with div without classes and stuff or some custom logic with your existing code? – Nothing Jul 26 '19 at 20:55
  • Thank you both for your quick answers. I edited my current code to the question above. I managed to select the respective /tr and /td whom have the data I require. However, I can't seem to find a way how I can print out the respective strings (XXSTRING1XX until 7). I did manage to print out the itemname (string 1) – Bamieschijf Jul 26 '19 at 21:31

1 Answers1

0

EDIT Since you're looking for specific strings from some data source, let's say a text file containing unknown strings for example, then:

file.txt

some
unknown
strings
to
look
for
...

bs.py

import re
from bs4 import BeautifulSoup

filename = 'file.txt'  # file containing unknown strings
data = []
with open(filename, 'r') as f:  # open file
    data = f.readlines()
data = [line.strip('\n') for line in data]  # ['some','unknown','strings','to','look','for',...]

src = request.get(...)
soup = BeautifulSoup(src, 'html.parser')
results = []

for target in data:
    result = soup.find_all(string=re.compile(target))  # look at documentation for other functionalities!
    if result:  # if any results are found
        for string in result:
            string = string.split()  # cleanup
            results.append(string)
    else:  # no results found
        results.append(result)
print(results)  # do something

This should give you a general idea of what to do. If you're still unsure, look at BS4's documentation.

ugatah
  • 26
  • 4
  • Hello Ugatah. First of all I would like to thank you for your input. I tried solving my issue with regex, but I failed to do so. It was probably a dumb idea, but I changed the stringnames in my question above. These strings constantly change (variable). So I need to find a way to select the strings, without knowing what the strings are. – Bamieschijf Jul 27 '19 at 07:55
  • I am quite familiar to Powershell, using powershell and regex I did manage to select the data I need. However, I have no idea how to use regex in Python the same way I did it to Powershell. Maybe you have a better solution? I pasted my Powershell code in Codepile: [link](https://www.codepile.net/pile/b7b3BN61) – Bamieschijf Jul 27 '19 at 07:57
  • For unknown strings make them into a list in python, loop through list, plug the unknown string into regex or something similar in BeautifulSoup. Also regex differs slightly from language-to-language, I've never used it in PS. You can skim through how Python3 uses regex: https://docs.python.org/3/howto/regex.html. There are websites that can test python regex online. I would also check this post out, it might help: https://stackoverflow.com/questions/8936030/using-beautifulsoup-to-search-html-for-string. – ugatah Jul 27 '19 at 17:41
  • Thanks for your input again. I tried many things, but I don't really understand what you mean. I am pretty sure that I need to make a loop for every Regex expression (I used 4 in Powershell). I just don't have a single clue how I'd make it in Python :/ – Bamieschijf Jul 28 '19 at 11:37