-3


I am new to python use. But learn by practice to use in my data processing.

I have a big data file in the format as shown here.
Always unknown number of rows and columns. In this example there are 2 consecutive rows shown.
The 1st column is "time" and nth column is relevant data to be chosen from an indentifier ('abc' in the 1st line).

................
"2013-01-01 00:00:02" 228 227 15.65 15.84 14.85 14.68 14.53 13.75 12.45 12.55
"2013-01-02 00:01:03" 225 227 16.35 15.99 14.85 14.73 14.43 13.8 12.85 13.2
................

Desired output as

  1. Column1 = in terms of time so that time difference can be calculated.
  2. column (n) = data to be processed further, should be in float.

In my past trials, I end up in list, hence unable to convert either of the column.

I tried to search over past questions and answers. But failed to interpret all, as I am a beginner. I expect your quick help to read the data into column format, so as to process later. I believe, further processing can be taken care as it is more mathematical operation.

I thank you for your help indeed.

Regards
Gouri

CORRECTION-1:
I understood pandas gives a compact version to extract the column as I needed earlier. Good learning after suggestion from group.
code looks like as follows:

import pandas as pd
data = pd.read_csv(fp, sep='\t')
entry=[]
entry = data['u90']
print entry, '\n', entry[5]

out_file = open("out.txt", "w")
entry.to_csv(out_file)

Regards
Gouri

Gouri
  • 17
  • 1
  • 5
  • this is a question that is being asked very often, use [pandas](http://pandas.pydata.org/pandas-docs/stable/io.html) to read your data for example – Deusdeorum May 05 '16 at 08:18
  • Being new to python coding, pandas seems to be bit complex to understand. I will keep this for my further practice. And also this is part of my assignment. Hence looking forward to solve sooner with simple code practice. Above all, thank you for the suggestion. – Gouri May 05 '16 at 16:37
  • As per Hugo's suggestion I tried with pandas and its quite efficient. – Gouri May 05 '16 at 18:20

3 Answers3

1

If you are interested in using Regular expression, and not pandas, then for your dataset, the following code works.

import re

#l1 = ["\"2013-01-01 00:00:02\" 228 227 15.65 15.84 14.85 14.68 14.53 13.75 12.45 12.55",
#"\"2013-01-02 00:01:03\" 225 227 16.35 15.99 14.85 14.73 14.43 13.8 12.85 13.2"]

l1 = """"2013-01-01 00:00:02\" 228 227 15.65 15.84 14.85 14.68 14.53 13.75 12.45 12.55
"2013-01-02 00:01:03\" 225 227 16.35 15.99 14.85 14.73 14.43 13.8 12.85 13.2"""

l_match = re.findall(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\"\s\d+\s\d+\s\d+\.\d+\s(\d+\.\d+)',l1)

if l_match:
    for each_find in l_match:
        l_date = each_find[0]
        l_number = float(each_find[1])
        print l_date
        print l_number

Output

2013-01-01 00:00:02
15.84
2013-01-02 00:01:03
15.99
pmaniyan
  • 1,046
  • 8
  • 15
  • after using pandas, I am able to get other entry and time in column array as '2013-01-01 00:00:00', '2013-01-01 00:00:01' and so on. I wish to convert these values in seconds after subtraction of a base time ('2013-01-01 00:00:00' as example). This would lead a revised data as 0, 1, .....(in seconds). Any help please. – Gouri May 05 '16 at 21:54
0

As pointed out by Hugo Honorem in comment, you can use pandas.

If you do not want to introduce more dependencies to your project, you could use a function like this:

from operator import itemgetter

def load_dataset(fp, columns, types=None, delimiter=' ', skip_header=True):
    get_columns = itemgetter(*columns)
    if skip_header:
        next(fp)
    dataset = []
    for line in fp:
        parts = line.split(delimiter)
        columns = get_columns(parts)
        if types is not None:
            columns = [convertor(col) for convertor, col in zip(types, columns)]
        dataset.append(columns)
    return dataset

columns should be list of integers, types is list of callable objects that convert desired columns into types you want them to be. For floats, just pass in float and for your date, you could pass custom to_date function.

tavo
  • 440
  • 2
  • 9
0

What you have is a CSV file, with whitespace as a separator, so you can use the CSV library (https://docs.python.org/2/library/csv.html). Otherwise, you can read line by line and parse with split()

f = open('myfile.csv','r')
for line in f.readlines():
    date = line.split(' ')[0]
    value = line.split(' ')[N]

Where N is the column where your value is located (in your example, 4).

Nevertheless, I strongly recommend pandas, it will take your code quality to the next level.

user1695639
  • 71
  • 1
  • 4
  • This code is working fine to begin with a minor change as 'date = line.split('\t')[0]'. – Gouri May 05 '16 at 16:02