The answer by @AndrejKesely is definitely a more pythonic way to handle this case, but you mention in the comments that you are still interested as to why your original method had missing values. Justifiably so! This is where learning how to code should start: by trying to understand why code is failing, well before moving on to a refactored solution.
1. The phone numbers
Let's make some prints:
for row in table_.findAll("tr"):
cells = row.findAll('td')
if len(cells)==6:
# ...
# Phone.append(cells[5].find(text=True).replace('T: ',''))
# ...
print(cells[5].findAll(text=True))
['T:\xa0', '204-775-5440', '\xa0\xa0', 'School Contact Information']
['T: 204-586-8497', '\xa0\xa0', 'School Contact Information', '\xa0']
The problem here is inconsistency in the source code. Open up Chrome DevTools with Ctrl + Shift + J
, right click on any of the phone numbers, and select inspect
. You'll move into the "Elements" tab and see how the html
is set up. E.g. first two numbers:
ph_no1 = """
<div>
<span>T: </span>
<span lang="EN">204-775-5440
<span> </span>
</span>
</div>
<div> ... School Contact Information </div>
"""
ph_no2 = """
<div>
<span lang="FR-CA">T: 204-586-8497
<span> </span>
</span>
</div>
<div> ... School Contact Information </div>
"""
The aforementioned prints with findAll
get you the texts from each span
consecutively. I've only shown the first two here, but that's enough to see why you get different data back. So, the problem with the first entry of numbers is that cells[5].find(text=True).replace('T: ','')
is only getting us the first text snippet and in the case of ph_no1
this is 'T:\xa0'
. For the reason why the replace
cannot handle this, see e.g. this SO post
.
As it happens, a couple of phone numbers were problematic:
df['Phone Number'][df['Phone Number']\
.str.extract(r'(\d{3}-\d{3}-\d{4})')[0]\
.to_numpy()!=df['Phone Number'].to_numpy()]
0 T:
32 204-783-9012 # 2 extra spaces
33 204-474-1492 # 2 extra spaces
38 204-452-5015 # 2 extra spaces
Suggested solution for the phone numbers. Instead of your code, try getting all the text and extracting a regex pattern that matches the number using re.search
:
import re
Phone.append(re.search(r'(\d{3}-\d{3}-\d{4})',cells[5].get_text()).group())
# e.g. \d{3}- means 3 digits followed by "-" etc.
2. The postal code
Problem here is basically the same. Here's an irregular postal code (39th entry), followed by a "regular" one;
pc_error = """
<div>
<span>290 Lilac St. </span>
<br>R3M 2T5
</div>
"""
regular_pc = """
<div>
<span>960 Wolseley Ave. </span>
</div>
<div>
<span>R3G 1E7
</span>
</div>
"""
You wrote:
Address.append(cells[4].find(text=True))
PostalCode.append(cells[4].find(text=True).next_element.getText())
But as you can see above, it turns out that the first example does not actually have a next_element
. Now, if you try:
print(len(cells[4].findAll(text=True)))
You'll find that, regardless of the elements, the entire text of each cell will in fact be captured as a list of two strings (['address','postal code']
). E.g.:
['511 Clifton St.\xa0', 'R3G 2X3']
['136 Cecil St.\xa0', 'R3E 2Y9']
So, in this particular case, we could simply write:
Address.append(cells[4].findAll(text=True)[0].strip()) # 1st elem and strip
PostalCode.append(cells[4].findAll(text=True)[1].strip()) # 2nd elem and strip
(or again do .get_text()
and use a regex pattern; as done by @AndrejKesely).
Hope this helps a bit in clearing up the issues, and suggesting some methods of how to spot unexpected behaviour (prints are always a good friend!).