0

I'm trying to fetch a text section within a parsed HTML page. The text starts after a pattern ("Item c") that occurs multiple times in the page (i.e.: there are 3 "Item c").

When I run my code I only parse the last occurrence while I would need just the first one.

Here's the HTML structure of the first occurrence and some code I've used to find the beginning and end of the text:

<p>
   <font style="display:inline;">Item c.&nbsp;&nbsp;Mike’s bike</font>
</p>...
a <- grep("^Item\\s{0,}c.\\s{0,}M", f.text, ignore.case = TRUE)
b <- grep("^Item\\s{0,}d.\\s{0,}Q", f.text, ignore.case = TRUE)

I tried with the exact match of part of the words but it doesn't always work.

Is there an indexing/more general matching tip I can use?

Thank you in advance

Disclaimer: fairly new with R:)

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
giu
  • 23
  • 5
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. What exactly does "doesn't always work" mean in this case? – MrFlick Apr 29 '19 at 21:06
  • It's difficult to provide good advice without a reproducible example, but you might be better off looking into `str_extract_all()` which is in the `stringr` package, also in `tidyverse`. E.g. `str_extract_all(f.text, "Item\\s{0,}c.\\s{0,}.*?(?=<)")` will return all matches, e.g. `"Item c.  Mike’s bike"`, `"Item c.  Jane’s scooter"`. You can then trim these strings as necessary. (The above code uses a _lazy quantifier_ and _positive lookahead_ - you might want to look into these.) – Stuart Allen Apr 30 '19 at 00:58
  • @StuartAllen and @ MrFlick thank you so much. Sorry not a good reproducible example indeed but your critiques are really helping me understand. In particular, I think the problem is with the HTML entities for non-breaking space: in some files there simply a "space" in others there is " / " (and that's why some times the regex syntax works, sometimes it doesn't). Now the question is, how do I match both "Item c.  Jane’s" and "Item c. Jane’s"? Your help is super appreciated – giu Apr 30 '19 at 10:57
  • @giu Re the nbsp issue, if I was doing this I'd probably just - as a first step - replace all ` ` with space: `str_replace_all(f.text, " ", " ")`. – Stuart Allen Apr 30 '19 at 11:59
  • @StuartAllen thanks a lot for your clarification. In particular, in your first example, how do I match only "Item c. Mike's bike" and not "Item c. Jane's scooter"? My code seems to be grabbing only the last occurrence. Thank you – giu Apr 30 '19 at 23:05
  • @giu You'll really need to post your code and some test data at the very least, otherwise it's very difficult to say why your code is returning a given output. – Stuart Allen May 01 '19 at 00:31

0 Answers0