I am using Outwit hub to scrape a website for city, state, and country (USA and Canada Only). With the program I can use regular expressions to define the markers Before and After the text I wish to grab. I can also define a format for the desired text.
Here is a sample of the html:
<td width="8%" nowrap="nowrap"></td>
<td width="22%" nowrap="nowrap"><strong>
BILLINGS, MT
USA</strong></td>
<td width="10%" align="right" nowrap="nowrap">
I have set up my reg.ex. as follows:
CITY - Before (not formated as regex)
<td width="22%" nowrap="nowrap"><strong>
CITY - After (accounts for state, territory, and provences)
/(,\s|\bA[BLKSZRAEP]\b|\bBC\b\bC[AOT]\b|\bD[EC]\b|\bF[LM]\b|\bG[AU]\b|\bHI\b|\bI[ADLN]\b|\bK[SY]\b|\bLA\b|\bM[ABDEHINOPST]\b|\bN[BLTSUCDEHJMVY]\b|\bO[HKNR]\b|\bP[AERW]\b|\bQC\b|\bRI\b|\bS[CDK]\b|\bT[NX]\b|\bUT\b|\bV[AIT]\b|\bW[AIVY]\b|\bYT\b|\bUSA|\bCanada)/
STATE - Before
\<td width="22%" nowrap="nowrap"\>\<strong\>\s|,\s
STATE - After
/\bUSA\<\/strong\>\<\/td\>|\bCanada\<\/strong\>\<\/td\>/
STATE - Format
/\b[A-Z][A-Z]\b/
COUNTRY - Before (accounts for state, territory, and provences)
/(\bA[BLKSZRAEP]\b|\bBC\b\bC[AOT]\b|\bD[EC]\b|\bF[LM]\b|\bG[AU]\b|\bHI\b|\bI[ADLN]\b|\bK[SY]\b|\bLA\b|\bM[ABDEHINOPST]\b|\bN[BLTSUCDEHJMVY]\b|\bO[HKNR]\b|\bP[AERW]\b|\bQC\b|\bRI\b|\bS[CDK]\b|\bT[NX]\b|\bUT\b|\bV[AIT]\b|\bW[AIVY]\b|\bYT\b)\s/
COUNTRY - After (not formated as regex)
</strong></td><td width="10%" align="right" nowrap="nowrap">
The issue arrises when there is no city or state listed. I have tried to account for this, but am just making it worse. Is there any way this can be cleaned up and still account for the possibility of missing info? Thank you.
Example with no city:
<td width="8%" nowrap="nowrap"></td>
<td width="22%" nowrap="nowrap"><strong>
MT
USA</strong></td>
<td width="10%" align="right" nowrap="nowrap">
Example with no city / state: (yes, there is an extra line break)
<td width="8%" nowrap="nowrap"></td>
<td width="22%" nowrap="nowrap"><strong>
USA</strong></td>
<td width="10%" align="right" nowrap="nowrap">
Thank you for any help you can provide.