Splitting a list with regex

Question

I am having some trouble trying to split each element within a nested list. I used this method for my first split. I want to do another split to the now nested list. I thought I could simply use the same line of code with a few modifications goal2 = [[j.split("") for j in goal]], but I continue to get a common error: 'list' object has no attribute 'split'. I know that you cannot split a list, but I do not understand why my modification is any different than the linked method. This is my first project with web scraping and I am looking for just the phone numbers of the website. I'd like some help to fix my issue and not a new code so that I can continue to learn and improve my own methods.

import requests
import re
from bs4 import BeautifulSoup


source = requests.get('https://www.pickyourownchristmastree.org/ORxmasnw.php').text
soup = BeautifulSoup(source, 'lxml')

info = soup.findAll(text=re.compile("((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})"))[:1]
goal = [i.split(".") for i in info]
goal2 = [[j.split("") for j in goal]]

for x in goal:
    del x[2:]

for y in goal:
    del y[:1]



print('info:', info)
print('goal:', goal)

Output without goal2 variable:

info: ['89426 Green Mountain Road, Astoria, OR 97103. Phone: 503-325-9720. Open: ']
goal: [[' Phone: 503-325-9720']]

Desired Output with "goal2" variable:

info: [info: ['89426 Green Mountain Road, Astoria, OR 97103. Phone: 503-325-9720. Open: ']
goal: [[' Phone: 503-325-9720']]
goal2: ['503-325-9720']

I will obviously have more more numbers, but I didn't want to clog up the space. So it would look somthing more like this:

goal2: ['503-325-9720', '###-###-####', '###-###-####', '###-###-####']

But I want to make sure that each number can be exported into a new row within a csv file. So when I create a csv file with a header "Phone" each number above will be in a seperate row and not clustered together. I am thinking that I might need to change my code to a for loop???

I ran your code and i had no problems... – johnashu Apr 23 '20 at 19:18 — johnashu, Apr 23 '20 at 19:18

r.ook · Accepted Answer · 2020-04-23T20:16:55.117

1

The cleaner approach here would be to just do another regex search on your info, e.g.:

pat = re.compile(r'\d{3}\-\d{3}\-\d{4}')
goal = [pat.search(i).group() for i in info if pat.search(i)]

Outputs:

goal: ['503-325-9720']

Or if there are more than one number per line:

# use captive group instead
pat = re.compile(r'(\d{3}\-\d{3}\-\d{4})')
goal = [pat.findall(i) for i in info]

Outputs:

goal = [['503-325-9720', '123-456-7890']]

edited Apr 23 '20 at 20:16

answered Apr 23 '20 at 19:47

r.ook

13,466
2
22
39

So that only works if I run the code with the first index. As soon as I take it out I receive an error: `AttributeError: 'NoneType' object has no attribute 'group'` – Binx Apr 23 '20 at 19:56
Ah, there are some that don't match then. See my updated answer. Also included case for more than one phone number. – r.ook Apr 23 '20 at 20:20
That works. Could you explain what the `r` is within the `re.compile` function? I ended up changing that expression to the same thing I had within my `info` variable because some numbers included `( )`. Also, is `i` the index? – Binx Apr 23 '20 at 20:58
`i` each of the text string extracted from your `soup.findAll`. You can name it something else more descriptive of course. The `r` in front of string quotations denote it as a raw string which eliminates the need to escape each of the `\ `. See more info on the lexicals here: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals – r.ook Apr 23 '20 at 21:05
Also, if you need to handle the `( )`'s, make sure those are escaped in regex since they have special meanings (capture group) in regex. e.g. if your first 3 digits might contain brackets: use: `r'(\(?\d{3}\)?\-\d{3}\-\d{4})'` – r.ook Apr 23 '20 at 21:07

Splitting a list with regex

1 Answers1