1

I'm scraping a website in order to store data in a database that has 3 columns. The part of the webstsite i'm scraping looks like one of either of the three examples below

# Example 1:
<div>
<a href="sample1">text1</a>
</div>

# Example 2:
<div>
<a href="sample1">text1</a>
<a href="sample2">text2</a>
</div>

# Example 3:
<div>
<a href="sample1">text1</a>
<a href="sample2">text2</a>
<a href="sample3">text3</a>
</div>

I'm trying to assign

  • "text1" to var1,
  • either an empty string or "text2" to var2,
  • either an empty string or "text3" to var3.

What is the best method to do this??

A few things I've tried are

### FIRST ATTEMPT
var1, var2, var3 = '','',''
# could also do var1, var2, var3 = ('',)*3
all = soup.find_all('a')

var1 = all[0].text

try:
    var2 = all[1].text
except:
    pass

try:
    var3 = all[3].text
except:
    pass

#### SECOND ATTEMPT
all = [s.text for s in soup.find_all('a')]
# This is where i get stuck... This could return a list of length 1, 2, or 3, and I need the output to be a list of length 3 so i can use the following line to assign variables
var1, var2, var3 = all

#### THIRD ATTEMPT
all = [s.text for s in soup.find_all('a')]
var1, var2, var3 = '','',''
n = len(all)
var1 = all[0].text
if n = 2:
    var2 = all[1].text
else:
    var2 = all[1].text
    var3 = all[2].text

EDIT: The reason i'm trying to have three fields in my db is because I want to be able to filter by each of these different variables. var1 is the most accurate label, var2 is slightly more accurate, and var3 is accurate at a high level. Think of it like clothing... var1 could be grey-slacks, var2 could be business-slacks, and var3 could be pants.

exhoosier10
  • 121
  • 4
  • 8
  • Might I ask *why* you're trying to do it this way? Instead, just assign the results of `find_all` to a list, and then you can use the objects *within* the list. – David Zemens Nov 14 '15 at 01:54
  • I'm trying to build out a database. The database will ultimately have three columns as this section of code can have up to three values and i would like to capture them all. At some point, i'll have to either set the field to be blank, so i'm trying to figure out a neat way to do so rather than the brute force if...elif... method. I might come across a time where there are > 3 fields i'd like to capture and I'd like to have a smoother way to do so in that case, too. – exhoosier10 Nov 14 '15 at 01:56
  • OK, you can capture them all and still use other logic to control how/when they write in to your db. I'll write some ansewr below to help out if I can. – David Zemens Nov 14 '15 at 01:57

4 Answers4

2

You can use some simple list multiplication:

# use a constant at the top of your script in case the number of columns
# change in the future
COLUMNS = 3

# ... other code ...

all = [s.text for s in soup.find_all('a')]
all.extend(['']*(COLUMNS-len(all))) # append 1 empty string for each missing text field
var1, var2, var3 = all 

But as David Zemens has mentioned in the comments, there has got to be a better way to do this. I can't make any concrete suggestions without seeing the code that consumes your text variables, but you should seriously reconsider your design. Even if you use the constant like I suggested, having var1, var2, var3 = all is still going to make it difficult to maintain and modify this script in the future.


Based on your edit, I would suggest you use a dictionary instead. This will allow you to reference specific data by name, like you would reference a variable, but retains the flexibility of a list instead of restricting you to the number of variables you have hard coded.

For example:

all = [s.text for s in soup.find_all('a')]

d = {}
for i, field in enumerate(all): 
    d['var{}'.format(i)] = field

# later in your code that consumes this dictionary...

try:
    foo(d['var1']) # function to do something with the scraped string corresponding
                   # to var1
except KeyError:
    # do something else or pass when the expected data doesn't exist

If all is ['a', 'b'], then this code produces this:

{'var1': 'b', 'var0': 'a'}

Variable assignments are really nothing more than a mapping - your code knows the variable name, and it can look up the corresponding value. A dictionary lets your code build the mapping on the fly instead of you having to hard code it. Now we have built a dictionary where the varX variables are constructed dynamically. If you decide to add another column, you don't have to change this code at all. You just add your code that would use var4 and be ready to catch the exception if var4 doesn't exist in the dictionary. No more adding empty strings - your code is ready to handle the case where the data it's looking for doesn't exist.

Notes:

  1. The enumerate() function iterates over an iterable object and increments a counter for you. In my code, i is the counter (so we can construct the 'var1', 'var2'... strings), and field is each item from the list.
skrrgwasme
  • 9,358
  • 11
  • 54
  • 84
  • This looks pretty good and solves the immediate problem, of course for maintainability, hardcoding the *number of elements* may not be ideal; the number of columns (and consequently the number of "variables" might need to change in the future, etc.) – David Zemens Nov 14 '15 at 02:03
  • @DavidZemens I strongly agree. I also agree with your earlier comments that there must be a better way to do this. I've adjusted my answer to suggest at least using a constant that will be easier to change if the OP decides to stick with this approach. – skrrgwasme Nov 14 '15 at 02:05
  • extend is perfect in this situation. As I'm thinking about db functionality, in order to quickly filter the db, I might make 4 columns.... var_all, var1, var2, and var3. Since var1 is the most appropriate label while the other labels are secondary (think of clothing... var1 could be grey-slacks, var2 could be business-slacks, and var3 could be pants), being able to quickly filter all labels could be useful, too. – exhoosier10 Nov 14 '15 at 02:08
  • @exhoosier10 What David and I are getting at though, is there is probably a better way to do this than splitting out the list into explicit variables. These variables are being consumed by other code somewhere, right? If you can make that code operate on the list rather than individual variables, your script will be *much* easier to modify later. You're intentionally subtracting from the flexibility that lists are intended to provide. – skrrgwasme Nov 14 '15 at 02:11
  • @skrrgwasme I understand what you're saying. Assuming a field of "grey-slacks, business-slacks, pants", i should be able to filter the db by any of these values or something like where var_all = "grey-slacks%"... So i'm starting to think splitting into variables might be a waste of resources – exhoosier10 Nov 14 '15 at 02:16
  • @exhoosier10 I'm making an edit now that I think you'll be interested in... Stand by a moment. – skrrgwasme Nov 14 '15 at 02:17
  • @exhoosier10 see my answer for another alternative. – David Zemens Nov 14 '15 at 02:25
  • @exhoosier10 I've finished my edit if you want to take a look, but David's looks good as well. – skrrgwasme Nov 14 '15 at 02:30
  • @skrrgwasme -- I've always thought if I'm forcing myself to use try...except, i'm doing something wrong, which is why I originally posted my question. That being said, I've never known a method in python to create dynamic variables (as, again, i've always been told it's bad practice, so i never looked too much into it), so thank you for this. – exhoosier10 Nov 14 '15 at 02:36
  • @DavidZemens I'll take a look at your code and compare with this and ultimately mark which one best solved my problem. At this point, both answers were very useful and I appreciate both your help and skrrgwasme 's. – exhoosier10 Nov 14 '15 at 02:40
  • @exhoosier10 I think you're misunderstanding a couple of pieces of advice that do have some truth to them. You shouldn't use exceptions for normal flow control - if you're using exceptions to exit functions instead of proper function returns, for example, that is wrong. My usage of exceptions could arguably be replaced with an `if` statement to check if an item exists in the dictionary before trying to access it instead, but I prefer to use the [EAFP](http://stackoverflow.com/q/11360858/2615940) approach, as is common among Python programmers. – skrrgwasme Nov 14 '15 at 02:44
  • Great minds think alike, good use of dictionary object for mapping! – David Zemens Nov 14 '15 at 02:44
  • @exhoosier10 You can check [here](http://stackoverflow.com/q/16138232/2615940) for another SO question on using exceptions. As for the dictionary, perhaps I shouldn't have suggested so strongly that it is like dynamically creating variables. It's not self-modifying code; dictionaries are common data structures used in many languages (sometimes with different names) that allow you to associate objects with each other. In my example, I am simply mapping strings (`var1`, etc) to the strings you scraped from the website. – skrrgwasme Nov 14 '15 at 02:48
  • @exhoosier10 Building on your pants example, you could (and should) replace `'var1'` with `'grey-slacks'`, `'business-slacks'`, etc..Then your later code that does something with the string that corresponds to `'grey-slacks'` can index into the dictionary with that string to retrieve the appropriate scraped string.No grey slacks string scraped from the website? That's fine, just catch the exception. I assume `"sample1"`, etc are strings you can scrape as well. Use those strings as dictionary keys and your column headers, and you can refer to everything by descriptive names. – skrrgwasme Nov 14 '15 at 02:53
  • @skrrgwasme So there are about 100 different categories of clothing on this website. Ultimately, per article of clothing, there could be 1, 2, or 3 descriptions in the HTML. How do you recommend setting up a db to best account for these descriptive variables? Ultimately, all of this feedback in my python structure is just going to get inserted into a db. – exhoosier10 Nov 14 '15 at 03:14
2

Your second attempt is probably more pythonic. Of course, you don't know in advance whether the result of .find_all will be a list of length ==3 (or more, or less). So you should use the try/except or other logic to control how/when the results are written to your database.

# create a dictionary of your database column names:
dbColumns = {0:'column1', 1:'column2', 2:'column3'}

# get all the results; there might be 0 or 3 or any number really, 
#     we'll deal with that later
results = [s.text if s.text else "" for s in soup.find_all('a')]

# iterate the items in the list, and put in corresponding DB
for col in range(len(results)):
    # use the dbColumns dict to insert to the desired column

    query = "Insert INTO [db_name].[" + dbColumns[col] + "]"
    query += "VALUES '" + results[i] + '"

    """
    db.insert(query)  # assumes a db object that has an "insert" function; modify as needed
    """

The point of this approach is that there seems to be nothing about this problem that technically would require hardcoding exactly three objects (var1, var2, var3) and trying to assign to these. Instead, just return the results of find_all and deal with them by their index within that resulting list.

David Zemens
  • 53,033
  • 11
  • 81
  • 130
1

How about:

all = soup.find_all('a')
var1 = all[0].text if len(all) > 0 else ""
var2 = all[1].text if len(all) > 1 else ""
var3 = all[2].text if len(all) > 2 else ""

The conditional expression x if y else z (often called a ternary operator) keeps the code simple and readable. It's not going to win any design awards though.

Galax
  • 1,441
  • 7
  • 6
0

you can try this

#if list1 has uncertain number of values and you want to give them each variable 

#create random list2 with max number of possible veriables 
list2 = ['var1', 'var2', 'var3', 'var4' , . . . ]

for li1, li2 in zip(list1, list2):
    globals()[li2] = li1
    print(li2)

I am not pro in python, i just figured this out on my own it might not be very pythonic but it solves the problem

Assad Ali
  • 288
  • 1
  • 12