1

I have many text files. In each text file, there is a section of interest (below):

    <tr>
        <td ><b>发起时间</b></td>
        <td colspan="2" style="text-align: left">2015-04-08</td>
        <td style="width: 25%;"><b>回报机制</b></td>
        <td colspan="2" style="text-align: left">使用者付费</td>
    </tr>

The information that varies across files is the date only. In this case, the date is 2015-04-08.

I want to extract the date. I am an R user, and I normally would use str_match from the stringr package. I would indicate the following as the start of the string:

        <td ><b>发起时间</b></td>
        <td colspan="2" style="text-align: left">

However, I am not sure what to do given that this string is spread over two lines. What can I do? (It also contains Chinese characters, but that's a separate issue)

But I'm not sure how to do so, given that

wwl
  • 2,025
  • 2
  • 30
  • 51
  • 1
    I would suggest trying a regular expression if your date format changes. You can browse through this link for starters : https://www.regular-expressions.info/rlanguage.html – Gaurav Taneja Sep 27 '17 at 00:04
  • If you're parsing HTML, I would recommend using `rvest` to extract the text between the table tags. Then you don't need to worry about the additional HTML. – Jake Kaupp Sep 27 '17 at 00:20

1 Answers1

1

Doing it with Regex

It's not advisable to use a regex to parse HTML due to all the possible obscure edge cases that can crop up, but it seems that you have some control over the HTML so you should able to avoid many of the edge cases the regex police cry about.

Proposed solution with Regex

Can you use the \s+ where the carriage return and new line would be. The resulting regex would look like this:

<td ><b>发起时间<\/b><\/td>\s+<td colspan="2" style="text-align: left">([0-9]{4}-[0-9]{2}-[0-9]{2})<\/td>

enter image description here ** To see the image better, simply right click the image and select view in new window

And based on your sample text. The first capture group would then contain the string of characters that resembled the date. It should be noted that the regex is not actually validating the date, it's just matching the format.

Explained

The \s+ regex will do the following:

\s matches any white space character
+ allows the preceeding regex to match 1 or more times

Since we know there will be a carriage return, new line, and what appears to be a tab or multiple spaces, then all of those will be matched. However if these whitespace characters are optional in your source files, then you could use the \s*. In this case the * will match zero or more whitespace characters.

Example

Please see this live example

Community
  • 1
  • 1
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43