1
#!/usr/bin/env python
import re
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://salesweb.civilview.com/Sales/SalesSearch?countyId=32')
soup = BeautifulSoup(page.text, 'html.parser')
list_ = soup.find(class_='table-striped')
list_items = list_.find_all('tr')
content = list_items
d = re.findall(r"<td>\d*/\d*/\d+</td>",str(content))
#d = re.findall(r"<td>\d*/\d*/\d+</td>|<td>\d*?\s.+\d+</td>",str(content))
a = re.findall(r"<td>\d*?\s.+\d+?.*</td>",str(content))
res = d+a
for tup in res:
    tup = re.sub("<td>",'',str(tup))
    tup = re.sub("</td>",'',str(tup))
    print(tup)

I'm getting sale dates then addresses when just printing to screen. I have tried several things to get to csv but I end up all data in 1 column or 1 row. I would like to just sale dates, addresses 2 columns with all returned rows. This is what I get just using print()

8/25/2021
9/1/2021
9/1/2021
9/1/2021
9/1/2021
9/1/2021
9/8/2021
9/8/2021
9/8/2021
9/8/2021
9/15/2021
9/15/2021
9/15/2021
9/15/2021
9/15/2021
9/15/2021
9/22/2021
9/29/2021
9/29/2021
9/29/2021
11/17/2021
4/30/3021
40 PAVILICA ROAD STOCKTON NJ 08559
129 KINGWOOD LOCKTOWN ROAD FRENCHTOWN NJ 08825
63 PHLOX COURT WHITEHOUSE STATION NJ 08889
41 WESTCHESTER TERRACE UNIT 11 CLINTON NJ 08809
461 LITTLE YORK MOUNT PLEASANT ROAD MILFORD NJ 08848
9 MAPLE AVENUE FRENCHTOWN NJ 08825
95 BARTON HOLLOW ROAD FLEMINGTON NJ 08822
27 WORMAN ROAD STOCKTON NJ 08559
30 COLD SPRINGS ROAD CALIFON NJ 07830
211 OLD CROTON ROAD FLEMINGTON NJ 08822
3 BRIAR LANE FLEMINGTON NJ 08822(VACANT)
61 N. FRANKLIN STREET LAMBERTVILLE NJ 08530
802 SPRUCE HILLS DRIVE GLEN GARDNER NJ 08826
2155 STATE ROUTE 31 GLEN GARDNER NJ 08826
80 SCHAAF ROAD BLOOMSBURY NJ 08804
9 CAMBRIDGE DRIVE MILFORD NJ 08848
5 VAN FLEET ROAD NESHANIC STATION NJ 08853
34 WASHINGTON STREET ANNANDALE NJ 08801
229 MILFORD MT PLEASANT ROAD MILFORD NJ 08848
1608 COUNTY ROAD 519 FRENCHTOWN NJ 08825
29 OLD SCHOOLHOUSE ROAD ASBURY NJ 08802
28 ROSE RUN LAMBERTVILLE NJ 08530

Any Help would be great I have been playing with this all day and can't seem to get it right no matter what I try

3 Answers3

2

My two cents :

#!/usr/bin/env python
import re
import requests
from bs4 import BeautifulSoup
import csv

separator = ','

page  = requests.get('https://salesweb.civilview.com/Sales/SalesSearch?countyId=32')
soup  = BeautifulSoup(page.text, 'html.parser')
list_ = soup.find(class_='table-striped')
list_items = list_.find_all('tr')
content = list_items
d = re.findall(r"<td>\d*/\d*/\d+</td>",str(content))
a = re.findall(r"<td>\d*?\s.+\d+?.*</td>",str(content))

for date, address in zip(d, a):
    print(re.sub("</td>|<td>",'',str(date)),
          separator, 
          re.sub("</td>|<td>",'',str(address)))

Output, date and address are now in one row:

8/25/2021 , 40 PAVILICA ROAD STOCKTON NJ 08559
9/1/2021 , 129 KINGWOOD LOCKTOWN ROAD FRENCHTOWN NJ 08825
9/1/2021 , 63 PHLOX COURT WHITEHOUSE STATION NJ 08889
9/1/2021 , 41 WESTCHESTER TERRACE UNIT 11 CLINTON NJ 08809
9/1/2021 , 461 LITTLE YORK MOUNT PLEASANT ROAD MILFORD NJ 08848
9/1/2021 , 9 MAPLE AVENUE FRENCHTOWN NJ 08825
9/8/2021 , 95 BARTON HOLLOW ROAD FLEMINGTON NJ 08822
9/8/2021 , 27 WORMAN ROAD STOCKTON NJ 08559
9/8/2021 , 30 COLD SPRINGS ROAD CALIFON NJ 07830
9/8/2021 , 211 OLD CROTON ROAD FLEMINGTON NJ 08822
9/15/2021 , 3 BRIAR LANE FLEMINGTON NJ 08822(VACANT)
9/15/2021 , 61 N. FRANKLIN STREET LAMBERTVILLE NJ 08530
9/15/2021 , 802 SPRUCE HILLS DRIVE GLEN GARDNER NJ 08826
9/15/2021 , 2155 STATE ROUTE 31 GLEN GARDNER NJ 08826
9/15/2021 , 80 SCHAAF ROAD BLOOMSBURY NJ 08804
9/15/2021 , 9 CAMBRIDGE DRIVE MILFORD NJ 08848
9/22/2021 , 5 VAN FLEET ROAD NESHANIC STATION NJ 08853
9/29/2021 , 34 WASHINGTON STREET ANNANDALE NJ 08801
9/29/2021 , 229 MILFORD MT PLEASANT ROAD MILFORD NJ 08848
9/29/2021 , 1608 COUNTY ROAD 519 FRENCHTOWN NJ 08825
11/17/2021 , 29 OLD SCHOOLHOUSE ROAD ASBURY NJ 08802
4/30/3021 , 28 ROSE RUN LAMBERTVILLE NJ 08530

Extra, to export to CSV using pandas :

import pandas as pd

date_list = []
address_list = []

for date, address in zip(d, a):
    date_list.append(re.sub("</td>|<td>",'',str(date)))
    address_list.append(re.sub("</td>|<td>",'',str(address)))
        
df = pd.DataFrame([date_list, address_list]).T
df.columns = ['Date', 'Address']

df.to_csv('data.csv')
Ibrahim Ayoup
  • 422
  • 2
  • 13
1

It seems to me that instead of using two regular expressions you should rather use one with named groups. I leave it to you to try.

Given that you have two corresponding lists of values, the simplest way would be instead of concatenating:

res = d+a

just going through pairs of them:

for tup, tup2 in zip(d, a):
    tup = re.sub("<td>",'',str(tup))
    tup = re.sub("</td>",'',str(tup))

    tup2 = re.sub("<td>",'',str(tup2))
    tup2 = re.sub("</td>",'',str(tup2))

    print(tup, tup2)
sophros
  • 14,672
  • 11
  • 46
  • 75
1
#!/usr/bin/env python
import re
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://salesweb.civilview.com/Sales/SalesSearch?countyId=32')
soup = BeautifulSoup(page.text, 'html.parser')
list_ = soup.find(class_='table-striped')
list_items = list_.find_all('tr')
content = list_items
d = re.findall(r"<td>\d*/\d*/\d+</td>",str(content))    #this is a list
#d = re.findall(r"<td>\d*/\d*/\d+</td>|<td>\d*?\s.+\d+</td>",str(content))
a = re.findall(r"<td>\d*?\s.+\d+?.*</td>",str(content))  #this is a list

## create a dataframe with two lists and remove tags

df = pd.DataFrame(list(zip(d,a)), columns=['sales_date','address'])

for cols in df.columns:
    df[cols] = df[cols].map(lambda x: x.lstrip('<td>').rstrip('</td>'))
    
df.to_csv("result.csv")
Simanti
  • 26
  • 3
  • This is also a working version that in the end gives the same results. Pandas was one of the modules I was trying to work with and grew to start to hate it. Thanks for the support – Steven Greer Aug 18 '21 at 12:51