-3

I've been using

pd.read_csv('file.csv',parse_dates=['date_time']) 

to parse dates and then run DateTimeIndex to read year, month, day from date_time variable. When done correctly, 'date_time' should be formatted as datetime64. But something is in the data column that I keep getting 'object' as variable format so I receive ValueError when DateTimeIndex it. My data is too big for me to find out what exactly happened. How should I handle this so I can perhaps change the anomalies to missing and get the data_time variable parsed? Thanks.

Update:

I did what Edchum suggested except pretty manually. Here is my guess why the data is out of sort: one date was supposed to be 2016 instead it shows 2161. does anyone know why python wouldn't parse date time in this case? How can I identify all rows that are similar to this and delete all?

CWeeks
  • 407
  • 1
  • 5
  • 15
  • Add sample data, in case there is something unique about it. Check syntax for parse_dates. – Merlin Jun 07 '16 at 20:03
  • I tried the method suggested in the thread but got below error: – CWeeks Jun 07 '16 at 20:04
  • OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-255-255 00:00:00 – CWeeks Jun 07 '16 at 20:04
  • You can use the binary search approach to find the problematic record (take the 1st/2nd half, see which has problem, then split further). – ivan_pozdeev Jun 07 '16 at 20:04
  • I am quite new to python, what is the binary method? – CWeeks Jun 07 '16 at 20:06
  • it doesn't give me error I parse the variable. it only does when I try to DateTimeIndex it. – CWeeks Jun 07 '16 at 20:07
  • Then edit the question to reflect what exactly you do. – ivan_pozdeev Jun 07 '16 at 20:08
  • You can parse the csv in chunks and check when it barfs, if you pass `chunksize=n_rows` where `n_rows` is some arbitrary value like 1000 you can then test if each chunk whether the conversion is successful or not and if not print the line that's duff, the other issue here is do you really want duff data? You could just drop those rows – EdChum Jun 07 '16 at 20:09
  • @ivan_pozdeev, is binary search approach that you mentioned this: – CWeeks Jun 07 '16 at 20:17
  • final['date_time'] = [dateutil.parser.parse(x) for x in final['date_time']] – CWeeks Jun 07 '16 at 20:17

2 Answers2

2

Try this:

import pandas as pd 
df = pd.read_csv('test.csv.gz', compression='infer',date_parser=True, usecols=([0,1,3]))
print df.head()

#       id            date_time  posa_continent
#    0   0  2015-09-03 17:09:54               3
#    1   1  2015-09-24 17:38:35               3
#    2   2  2015-06-07 15:53:02               3
#    3   3  2015-09-14 14:49:10               3
#    4   4  2015-07-17 09:32:04               3
Merlin
  • 24,552
  • 41
  • 131
  • 206
  • This is a neat way to load data directly from gz files. it still doesn't quite solve my problem. apology I cited date_time in the question (to stay general) but it is actually srch_ci column (search check in date) that had the issue. I played around with the data more and confirmed that it is that one particular row (row id=312920) causing all the trouble. So I had it deleted and everything went fine. But thanks for the help!!! – CWeeks Jun 08 '16 at 00:31
  • @CWlearner, if it one row edit that row and add back in. – Merlin Jun 08 '16 at 00:54
  • Yeah, I think you are right. I am going to change its value to missing so it can be treated later on. – CWeeks Jun 08 '16 at 01:10
1

This will help you diagnose the problem. Please run this snippet and post the output of bad_rows

df = pd.read_csv('file.csv')
bad rows = []
good_rows = []
for row, date in enumerate(df['date_time']):
    try:
        good_rows.append((row,dateutil.parser.parse(date)))
    except Exception as e:
        print(str(e))
        bad_rows.append((row,date))
michael_j_ward
  • 4,369
  • 1
  • 24
  • 25
  • I ran the code but bad_rows seems to capture all the rows vs. good_rows captures none.. – CWeeks Jun 07 '16 at 20:53
  • edited. Please give at least a few examples of the `date` that could not be parsed and the corresponding `Exception` – michael_j_ward Jun 07 '16 at 20:57
  • @michael_j_ward, thanks for following up. I ran it again but got error again. It says "ValueError: I/O operation on closed file" – CWeeks Jun 07 '16 at 22:53
  • @Merlin it is the test file from kaggle expedia contest. here is the link to data: https://www.kaggle.com/c/expedia-hotel-recommendations/data – CWeeks Jun 07 '16 at 22:55
  • I will also try to add a few lines to show as example – CWeeks Jun 07 '16 at 22:56