1

UPDATE: Regarding my 2nd question (how to convert string to date format in MySQL), I found a way and want to share it:

1) Save the "string date" data as VARCHAR (Don't use TEXT)

2) When showing MySQL data in PHP or other ways, use the function of str_to_date(string-date-column, date-format), such as the following example:

$sql = "SELECT * FROM yourtablename ORDER BY str_to_date(string-date-column, '%d %M %Y')";

I am using scrapy to collect data, write to database. From a website, the post date of each item is listed as following:

<p>   #This is the last <p> within each <div>
<br>
[15 May 2015, #9789]
<br>
</p>

So the date is always behind a "[" and before a ",". I used the following xpath code to extract:

sel.xpath("p[last()]/text()[contains(., '[')]").extract()

But I will get the whole line:

[15 May 2015, #9789]

So, how to get only the part of "15 May 2015"? If this can be done, how to convert the scraped string (15 May 2015) as real DATE data, so it can be used for sorting? Thanks a bunch!

LearnAWK
  • 549
  • 6
  • 17

2 Answers2

2

Regarding the first question, assuming that there is maximum one date at a time, you can use combination of XPath substring-after() and substring-before() functions to get 15 May 2015 part of the text node :

substring-before(substring-after(p[last()]/text()[contains(., '[')], '['), ',')

Regarding the second question, you can use datetime.strptime() to convert string to datetime :

import datetime

result = datetime.datetime.strptime("15 May 2015", "%d %b %Y")
print(result)
print(type(result))

output :

2015-05-15 00:00:00
<type 'datetime.datetime'>
har07
  • 88,338
  • 12
  • 84
  • 137
  • 1
    The second part of the question is [answered here](http://stackoverflow.com/q/1713594/190597). – unutbu Jun 28 '15 at 01:06
  • While trying to use the datetime function, I got the following error message: "exceptions.ValueError: time data "[u'15 May 2015']" does not match format '%d-%b-%Y'". Is it because Scrapy add special characters before and after the scraped string of "15 May 2015"? – LearnAWK Jun 28 '15 at 02:18
  • Related question: If I will put my data into MySQL anyway, will strings like "15 May 2015" be recognized as dates as long as I define that column as date type of data in MySQL? – LearnAWK Jun 28 '15 at 02:20
  • That's because the correct format, as demonstrated above, is "%d %b %Y" with no dash involved – har07 Jun 28 '15 at 02:30
  • Well, after I change it to "%d %b %Y", I still get the same error message: " exceptions.ValueError: time data "[u'20 May 2015']" does not match format '%d %b %Y' " – LearnAWK Jun 28 '15 at 03:29
1

A more "scrapic" approach would involve using the built-in regular expression support in the XPath expressions and/or .re().

This is with both applied:

In [1]: response.xpath("p[last()]/text()[re:test(., '\[\d+ \w+ \d{4}\, #\d+\]')]").re(r"\d+ \w+ \d{4}")
Out[1]: [u'15 May 2015']

Or, this is when you use .re() to extract the date locating the element as you did before:

In [2]: response.xpath("p[last()]/text()[contains(., '[')]").re(r"\d+ \w+ \d{4}")
Out[2]: [u'15 May 2015']
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Your "scrapic" approach works perfectly, alecxe! Do you have suggestions regarding use the "[u'15 May 2015']" output as date instead of string while considering scrapy and mysql usage in the whole picture? Will MySQL automatically recognized "[u'15 May 2015']" as date if I define that column as "date" type? – LearnAWK Jun 28 '15 at 02:13
  • @LearnAWK I would recommend being more explicit and operate with actual dates - it would also catch the errors earlier providing an additional validation. – alecxe Jun 28 '15 at 02:24