0

I am a newbie and trying to perform some calculations and analyse some data from a huge number of CSV files using Python 3. Each CSV file has data stored in about 30000 rows seperated by a semicolon. As these are system generated files, the first row has the names of Parameters in a different order for each file. So I want to extract these Parameter names, compare them and find out the positions of required parameter values for every file.

But before I even do that, when I read the first row of the CSV file, the list has a single string seperated by the ';' while the rest of the lines are properly extracted. I have tried splitting the text and other means but none of them have any effect on the output. Can someone kindly help solve this issue.

This is the piece of code I have written:

import os 
import sys
import csv
import math      
# Open a file
full_path = "C:\\Documents and Settings\\My Documents\\CSV files"
dirs = os.listdir( full_path )
filename = list()

"""This will find the file in the directory and print the first row of the 
CSV file - the names of the parameters."""
for file in dirs:
    path = full_path+'\\'+file
    with open(path, 'rt') as csvfile:
        #For printing the location name   
        Turbine_name = (os.path.splitext(file)[0])
        #Reading the files into an object
        csvreader = csv.reader(csvfile, delimiter=';')
        #Obtaining the first line of the CSV file - the names of the parameters
        Parameters = next(csvreader)
        print (Parameters)
        #From this line, the values of different Parameters are extracted
        for row in csvreader:
            Parameters = next(csvreader)
            print (Parameters)
            #To eliminate rows with insufficient data
            if len(Parameters)>11:            
                Created_time = Parameters[0]
                Grid_frequency = float(Parameters[3])
                Grid_KW = float(Parameters[4])
                Rotor_RPM = float(Parameters[10])

The positions of the parameters are taken for example here as I am yet to write the code for comparing strings. A sample of the input file looks like this:

createTime;"Grid CosPhi";"Grid Current";"Grid Frequency";"Grid kW";"Grid   VAr";"Grid Voltage";"Pitch angle 1";"Pitch angle 2";"Pitch angle 3";"Rotor RPM";"Temp.  5    214";"Temp.  6    217";"Temp.  9    227";"Winddirection";"Windspeed"
9/21/14 11:30:01 PM;N/A;N/A;49.963;211688.734;N/A;N/A;-1.06;-1.039;-1.119;19.379;47.167;36;64;N/A;6.319
9/21/14 11:40:01 PM;N/A;N/A;50.002;170096.297;N/A;N/A;-1.003;-0.96;-1.058;19.446;47.182;36.084;63.772;N/A;5.628
9/21/14 11:50:01 PM;N/A;N/A;50.021;175038.734;N/A;N/A;-0.976;-0.961;-1.082;18.805;47;36.223;63.153;N/A;5.577
9/22/14 12:00:01 AM;N/A;N/A;49.964;229942.016;N/A;N/A;-1.047;-1.018;-1.066;18.957;47.125;36.293;63.766;N/A;6.431
9/22/14 12:10:01 AM;N/A;N/A;49.908;200873.844;N/A;N/A;-0.997;-0.985;-1.06;19.229;47.028;36.334;63.962;N/A;6.076
9/22/14 12:20:01 AM;N/A;N/A;49.934;234467.609;N/A;N/A;-1.028;-0.986;-1.001;18.995;47.056;36.401;63.732;N/A;6.067
9/22/14 12:30:01 AM;N/A;N/A;49.96;270969.25;N/A;N/A;-1.138;-1.103;-1.122;18.983;47.274;36.499;64.014;N/A;6.487
9/23/14 12:30:01 PM;N/A;N/A;50.073;54458.719;N/A;N/A;-0.492;-0.52;-0.472;13;42.02;33.624;46.426;N/A;3.757

And when I get the output it looks like this:

>>>['createTime;"Grid CosPhi";"Grid Current";"Grid Frequency";"Grid kW";"Grid VAr";"Grid Voltage";"Pitch angle 1";"Pitch angle 2";"Pitch angle 3";"Rotor RPM";"Temp.  5    214";"Temp.  6    217";"Temp.  9    227";"Winddirection";"Windspeed"']
   ['9/21/14 11:40:01 PM', 'N/A', 'N/A', '50.002', '170096.297', 'N/A', 'N/A', '-1.003', '-0.96', '-1.058', '19.446', '47.182', '36.084', '63.772', 'N/A', '5.628']
   ['9/22/14 12:00:01 AM', 'N/A', 'N/A', '49.964', '229942.016', 'N/A', 'N/A', '-1.047', '-1.018', '-1.066', '18.957', '47.125', '36.293', '63.766', 'N/A', '6.431']
   ['9/22/14 12:20:01 AM', 'N/A', 'N/A', '49.934', '234467.609', 'N/A', 'N/A', '-1.028', '-0.986', '-1.001', '18.995', '47.056', '36.401', '63.732', 'N/A', '6.067']

As observed in some of the files, some of the Parameters are completely missing. This is why I need to find out positions of individual parameters in each CSV file. Any ideas how it can be done best are also appreciated. Thanks in advance.

Edit: Unfortunately I cannot use Pandas as it has to be seperately installed in Python34 and my office system (Windows XP, P4) does not support it. I want to do this with CSV module if possible.

Nikki_Champ
  • 1,047
  • 1
  • 9
  • 13

1 Answers1

1

I would suggest that you use pandas instead of csv as it is more capable of reading and creating spreadsheets. The first line of each file is usually used to name the columns. As each column has an unique name you do not need to know the position, since in pandas columns can be accessed by name.

import pandas as pd
data = pd.read_csv(yourFilePath, sep = ';', index_col = 0, parse_dates = True)

That would create a spreadsheet similar to MS Excel. The first column 'createTime' will be the index of your spreadsheet and pandas will try to parse the strings in this column to date/time information.

If you like to do some math now, you can do it column by column as pandas is able to interprete that correctly. For selection rows you can use selection by labels:

import pandas as pd
data1 = pd.read_csv(yourFilePath1, sep = ';', index_col = 0, parse_dates = True)
data2 = pd.read_csv(yourFilePath2, sep = ';', index_col = 0, parse_dates = True)
#select a specific time range only, e.g 22th Sept. 2014
range_of_interest = data.loc['20140922']
#select time range >= Sept. 22th, 2014
range_of_interest = data.loc['20140922':]

You can also align your dataframes according to the index. Here is a similar example

aligned_data = data1.join(data2, how = 'outer', rsuffic = '_2')

There are also multiple functions that allow you to calculate descriptions of the columns like mean value, standard deviation, etc. For example, the stastical values for the column 'Grid kW':

data['Grid kW'].describe()

Finally, you can do math with complete dataframes. pandas will automatically use the correct columns. If a speficic column is missing in one of the dataframes, the results of the calculation will be 'nan'. Same is valid for missing values. So the result of the calculation between the dataframes will always result in a dataframe with valid data only.

data1.loc['20140922'] + data2.loc['20140922']

That should cover your intentions.

Community
  • 1
  • 1
RaJa
  • 1,471
  • 13
  • 17
  • Hi @RaJa... Currently on my office computer, I don't have the module Pandas. I am downloading it now but is it possible to do the same operations using a text file in this case? Coz the system generates text files which I convert them to CSVs... Thanks again. – Nikki_Champ Jun 02 '15 at 06:30
  • Yes, of course. pandas reads multiple formats, even plain text files, as long as you have an unique separator for your columns. A blank space would be difficult in your case, since it is already part of your creation date. – RaJa Jun 02 '15 at 06:57
  • Moreover I don't want to mix up data from two different files... not even from two different rows. Is it possible to read data row-wise using Pandas coz I need to make calculations with the Parameters for a particular time-stamp... as in the same row data. So far I have not been able to find related documentation. – Nikki_Champ Jun 02 '15 at 11:48
  • I've edited my answer above to allow selection of specific time stamps. – RaJa Jun 02 '15 at 16:19
  • Thank you very much for your edits but I think I am not able to explain the problem clearly. To be very clear, I want to calculate individual slip values using only few of the parameters in each row of a single CSV file. As you have observed, a single time stamp includes both the date and time. But I dont want to select timestamps.. I want to calculate for all of the rows. Thats why I was asking for a loop or something to iterate over the rows like the next() does in the CSV module. Thanks – Nikki_Champ Jun 03 '15 at 05:12
  • For a simple iteration over your dataframe rows, you can use the iterate-function that comes with each dataframe-structure. see [this example](http://stackoverflow.com/a/16476974/4141279) – RaJa Jun 03 '15 at 07:23