-2

I'm trying to write a regex to scrape text for a jobs site I'm building. I'm relatively new to scraping (and coding) and have been using Parsehub to assist with the former. That's useful for scraping where a job element consistently matches an html element (eg job_title matches , same position, on a page). I can use Parsehub to scrape a relevant block of text but I'll need to use a regex to give Parsehub more direction when the info I need can only be distinguished in relation to other text.

I've spent hours trying to figure out the following. For example, I want to extract the deadline date from the following text:

To Apply

Deadline for applications is the 10th January 2021. Interviews will take place in the third week of January 2021.

I've written the regex:

/Deadline for applications is the\s([0-9a-zA-Z]\w*)\s([0-9a-zA-Z]\w*)\s([0-9a-zA-Z]\w*)

But how do I pull just the groups 1-3? If I add \1 or $1 for example, at the end, I get an error "regular expression does not match the subject string."

I have some way to go in learning here but if anyone has some pointers, they'd be much appreciated. Once I get the basic principles of the above, I'll be in a much better place.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
intdev
  • 11
  • 2

2 Answers2

0

Check out this link: https://javascript.info/regexp-groups#parentheses-contents-in-the-match

In order to get just the groups 1-3, you only have to omit group 0 as group 0 always represents the full match.

const text = "Deadline for applications is the 10th January 2021. Interviews will take place in the third week of January 2021.";

const matches = text.match(/Deadline for applications is the\s([0-9a-zA-Z]\w*)\s([0-9a-zA-Z]\w*)\s([0-9a-zA-Z]\w*)/);

// Omit the first result always representing the full match
matches.shift();

If you are looking for a more precise solution, can you please post more details about the code you are running?

  • Thanks. I'm using Parsehub for the scraping, which is doing some of the heavy lifting on the coding. Because there's not distinct a element to link 'Deadline for applications' to, I'm extracting the text en masse, and using a regex to pick out the info I need, as per the screen shot: http://intdevjobs.com/wp-content/uploads/2020/12/Screenshot-2020-12-24-at-11.48.29.png A further challenge will be, not every job ad on even this site will use the same text. Others may have 'Closing Date: xxx.' But, if I'm right, regex, or multiple regexes, are the way to go. – intdev Dec 24 '20 at 12:00
0

You can use named group. I think that this is more readable solution.

const str = "Deadline for applications is the 10th January 2021. Interviews will take place in the third week of January 2021."
const pattern = /Deadline for applications is the\s(?<day>\d{1,2})\w*\s(?<month>\w+)\s(?<year>\d{4})/

const match = str.match(pattern)

const {day, month, year} = match.groups;
console.log(day, month, year)
Robert
  • 2,538
  • 1
  • 9
  • 17