-3

I need to iterate over a large dataset, and store my results in a list. This is the code:

results = []
n=10000    
for i in range(1,n):
    text = requests.get("""https://www.chess.com/games/archive/eono619gameOwner=other_game&gameTypes%5B0%5D=chess960&gameTypes%5B1%5D=daily&gameType=live&page{}""".format(i)).text
    result = BeautifulSoup(text, 'html.parser')

now for populating my list, I can do:

   results += result

or:

   results.append(result)

is any option more efficient for dealing with large datasets? if so, why?

8-Bit Borges
  • 9,643
  • 29
  • 101
  • 198
  • Does this answer your question? [Python append() vs. + operator on lists, why do these give different results?](https://stackoverflow.com/questions/2022031/python-append-vs-operator-on-lists-why-do-these-give-different-results) – Iain Shelvington Jan 04 '20 at 22:02
  • 5
    `+=` is like *extend*, not append. – jonrsharpe Jan 04 '20 at 22:03
  • This may help [python list concatenation efficiency](https://stackoverflow.com/questions/12088089/python-list-concatenation-efficiency). Shows timing performance of different methods. – DarrylG Jan 04 '20 at 22:08
  • You might also want to consider using a generator, if there's no strict need to actually accumulate the results. – jarmod Jan 04 '20 at 22:08
  • 1
    @IainShelvington no, it does not. It modifies the list in-place, and is essentially equivalent to `.extend` – juanpa.arrivillaga Jan 04 '20 at 22:10
  • @DataGarden those two do different things, but they are both equally efficient (and as efficient as you will get). – juanpa.arrivillaga Jan 04 '20 at 22:10
  • 3
    @IainShelvington No, `+=` updates the list in-place. Augmented assignment operators were introduced precisely to provide an in-place option, since `results = results + result` necessarily creates a new list first. – chepner Jan 04 '20 at 22:10
  • The two operations do completely different things in this context. Please fix your question to account for that. – Mad Physicist Jan 04 '20 at 22:31
  • Funny. The question opens a discussion of 20 comments, the accepted answer is upvoted twice, but still the question gets 3 downvotes. the downvoter logic sometimes eludes me – 8-Bit Borges Jan 06 '20 at 21:54

2 Answers2

2

I would use a list comprehension instead.

url = "https://www.chess.com/games/archive/eono619"
params = {
  'gameOwner': 'other_game',
  'gameTypes[0]': 'chess',
  'gameTypes[1]': 'daily',
  'gameType': 'live'
}
results = [BeautifulSoup(requests.get(url, params={**params, 'page': i}).text, 'html.parser') for i in range(1,n)]

though that's pushing it in terms of complexity. (Also, you don't have the opportunity to check that your request succeeded before trying to parse the response with BeautifulSoup). Otherwise, use append.

results = []
for i in range(1,n):
    response = requests.get(url, params={**params, 'page': i})
    # TODO Make sure you got a 200 response first
    result = BeautifulSoup(response.text, 'html.parser')
    results.append(result)
chepner
  • 497,756
  • 71
  • 530
  • 681
0

For starters, results += result is not equivalent to results.append(result). It just happens to work because result is iterable, but will effectively flatten all your object together. Put another way, += is a somewhat more restricted equivalent of list.extend, not list.append. The latter is much closer to results += [result], which is clearly less efficient.

Since you have a fixed number of objects, a list comprehension, as suggested by @chepner's answer is perfectly fine.

For cases that have a variable number of objects being generated, I'd recommend a collections.deque with a generator expression. This is especially important for long sequences. Appending to a list is amortized to have long-term O(n) reallocation behavior, but it's a slow O(n), and pretty bad in the short term. deques are much faster than that, and can be turned into a list with a single allocation as soon as the length is known.

from collections import deque

template = 'https://www.chess.com/games/archive/eono619?gameOwner=other_game&gameTypes%5B0%5D=chess960&gameTypes%5B1%5D=daily&gameType=live&page={}'
results = deque(BeautifulSoup(template.format(i), 'html.parser') for i in range(1, n))
results = list(results)
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • @StefanPochmann. That's probably because the object is iterable. The list will *not* be equivalent to the appended one that way. – Mad Physicist Jan 05 '20 at 00:16
  • @StefanPochmann. That being said, thanks for the catch: I fixed the answer to include. – Mad Physicist Jan 05 '20 at 00:22
  • 1
    Do you have an example of `deque` being faster? I'd find that surprising, and I just did a test, `[x for x in range(1000000)]` was about factor 1.5 faster than `deque(x for x in range(1000000))`. – Stefan Pochmann Jan 05 '20 at 00:31
  • 1
    I also tried `[x for x in X if x]` and `deque(x for x in X if x)` with `X = [random.choice((True, False)) for _ in range(1000000)]` and again the list comprehension was much faster. – Stefan Pochmann Jan 05 '20 at 00:58
  • @StefanPochman. I don't think deque will ever be faster for an iterable that supports `len` (or actually a proper `__len_hint__`), regardless of the condition, since a list comprehension can always preallocate the upper bound. Make `X` into a generator and see if there's a difference (replace `[]` with `()`, and don't forget to reinitialize between runs). – Mad Physicist Jan 05 '20 at 01:17
  • @MadPhysicist list comprehensions don't preallocate ever. they always just use `.append `under the hood. In any case, appending to the end of a list will be faster than a deque, but slower at the beginning. – juanpa.arrivillaga Jan 05 '20 at 01:22
  • @MadPhysicist Ok I tried `[x for x in g() if x]` and `deque(x for x in g() if x)` with `def g(): for x in X: yield x`, and list comprehension was still much faster. – Stefan Pochmann Jan 05 '20 at 01:29
  • @juanpa.arrivillaga. I'm not sure I understand why. deques don't need to copy data/reallocate. Is it because the default block size is small enough that the frequent allocations dominate once the list size goes up sufficiently? – Mad Physicist Jan 05 '20 at 01:31
  • @Stefan. Thanks for running the tests and educating me. I'm not near a desktop atm, so doubly appreciate it. I'll delete this answer as soon as I understand why it's so wrong. – Mad Physicist Jan 05 '20 at 01:33