0

My source is txt file which takes the form of:

cpu95-20000117-04004,134.perl,42.6,44.4
cpu95-20000117-04004,147.vortex,44.7,44.7

I would like to parse the date with python into a form that can be plotted with matplotlib.pyplot (i.e. no strings or Timestamp objects). I will plot the last item (i.e. 44.4) against the dates (i.e. 2000/01/17). I'm also using this data as an input for scikitlearn linear regression model later on so I believe it should be int or float. Thanks much.

PS - I checked similar questions, but the trend is to use either .date() method or panda's pd.to_datetime and its variations, or methods that produce sorts of objects that don't fit into scikit model or matplotlib.

EDIT I should be more clear: I would like to plot the real dates (so no toordinal), and therefore cannot use the datetime option (wouldn't work for pyplot and scikit, when trying to turn datetime to int); therefore, I probably need to find a way to treat something like 2000/01/17 or 2000.01.17 as an integer.

oba2311
  • 373
  • 4
  • 12
  • Have you looked up __[here](https://stackoverflow.com/questions/1574088/plotting-time-in-python-with-matplotlib)__? Why would you fit a model with a date like that? The common practice is to use indices. Assume that `2000:01:17` is the initial period point that equals 1. Then, the next period would be equal to 2, and so on. There is no way you can treat `2000/01/17` or `2000.01.17` as `int` object. – E.Z Sep 09 '17 at 06:07

5 Answers5

1

Assuming that you can use an integer representation of the years and a float value for the last items in the lines as inputs to scikit this should do what you want.

toordinal returns something called the 'proleptic' for the date. This means that the 1st of January in the year 1 is represented by 1, January 2 becomes 2, etc. Which works fine for ordinary regression.

re.search winkles out the two pieces you need from the input lines for further processing.

Three lists are compiled as the for-loop progresses. Y eventually contains the final items in the input lines, dates_for_plotting the dates as needed by matplotlib and dates_for_regression the integer values as needed for your regression.

The last part of script shows how to use the dates as gathered from the input to create a plot.

>>> txt = '''\
... cpu95-20000117-04004,134.perl,42.6,44.4
... cpu95-20000117-04004,147.vortex,44.7,44.7
... '''
>>> import re
>>> from datetime import datetime
>>> Y = []
>>> dates_for_plotting = []
>>> dates_for_regression = []
>>> for line in txt.split('\n'):
...     if line:
...         r = re.search(r'-([^-]+)-(?:[^,]+,){3}([0-9.]+)', line).groups()
...         the_date = datetime.strptime(r[0], '%Y%m%d')
...         dates_for_plotting.append(the_date.date())
...         dates_for_regression.append(the_date.toordinal())
...         Y.append(r[1])
...         
>>> import matplotlib.pyplot as plt
>>> import matplotlib.dates as mdates
>>> plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
>>> plt.gca().xaxis.set_major_locator(mdates.DayLocator())
>>> plt.plot(dates_for_plotting, Y)
>>> plt.gcf().autofmt_xdate()
>>> plt.show()
Bill Bell
  • 21,021
  • 5
  • 43
  • 58
0

For this you probably have to write your own small parser.

You can use regular expressions, or use line.split(',') on every line in the file.

philippd
  • 214
  • 1
  • 6
0

wrap the number in int().

Example:

myString = "20000117"
try:
    myVar = int(myString)
except ValueError:
    pass # or take some action here

Python parse int from string

Wrap it in a try block to be safe.

Tyler Christian
  • 520
  • 7
  • 14
0

Maybe this is what you are looking for if I understood your question correctly :)

with open("YourFileName.txt",'r') as f:
    for line in f.readlines():
        line = line.strip()
        #line = "cpu95-20000117-04004,134.perl,42.6,44.4"
        items = line.split(',') # [cpu95-20000117-04004,134.perl,42.6,44.4]

        date = int(items[0].split('-')[1])
        lastItem = float(items[-1])
        # rest of your code
chowsai
  • 565
  • 3
  • 15
0

Not the best answer but you can try like this

import csv
from datetime import datetime
with open('file.txt', 'r') as file:
    dt = csv.reader(file, delimiter=',')
    for row in dt:
        date = int(row[0][6:14])
        value = float(row[3])
sgetachew
  • 351
  • 1
  • 2
  • 12