0

I have a CSV file which looks like below. Right now it doesn't have any columns and it contains some fields I actually don't need so I need to write it into a new file based on some conditions.

!PROJECT1, OBJECT1
2020-09-10+02:00,100,HHH,SAS,RM$20,1,1
2020-09-16+02:00,200,GGG,SAS,TAKE,2020-09-16+02:00
2020-09-13+02:00,300,TTT,SAS,TAKE,2020-09-13+02:00
2020-09-11+02:00,100,HHH,SAS,RM$20,1,1

These are the conditions:

  1. I will only write the record if index[4] contains the word TAKE. If so, take index[0], [4] and [5].
  2. Index[0] and [5] need to be spitted and named in YEAR, MONTH and DAY and TD. index[4] needs to be named TYPE

I want my new file to look like this:

YEAR    MONTH    DAY    TD    TYPE    YEAR    MONTH    DAY    TD 
2020    09       16     2     TAKE     2020    09       16     2
2020    09       13     2     TAKE     2020    09       13     2

This is my code:

def filter_row(r):
    condition_1 = r[4] == 'TAKE' #<-- take only the TAKE's

with open(file_path, 'r') as my_file, open('outputfile.txt', 'w') as outer:
        reader = csv.reader(my_file, delimiter = ',')
        next(reader) #Skip the first row because it's just the header
        writer = csv.writer(outer, delimiter = '\t')
        for row in reader:            
                if filter_row(row):
                writer.writerow(row)

Right now my output file looks like this:

2020-09-16+02:00,  200,  GGG,   SAS,  TAKE,  2020-09-16+02:00
2020-09-13+02:00,  300,  TTT,   SAS,  TAKE,  2020-09-13+02:00
NorthAfrican
  • 135
  • 2
  • 10

2 Answers2

1

There are two tasks here. Firstly, filter the data, secondly parse the data to get the required values.

Filtering can be done with the built-in filter function (or a list comprehension or generator expression if you prefer). For the parsing a small function can be used as we need to process two dates per row.

operator.itemgetter is used to extract data from rows efficiently.

import csv
import operator

# Get the value of the 'take' column
type_getter = operator.itemgetter(4)
# Extract the columns we want to work with
columns_getter = operator.itemgetter(0, 4, 5)


def process_datestring(datestring):
    # Assumes offset is always positive; consider
    # using re.split if it could be positive or negative.
    date, _, offset = datestring.partition('+')
    dateparts = date.split('-')
    td, *_ = offset.partition(':')
    dateparts.append(td)
    return dateparts


headers = ['YEAR', 'MONTH', 'DAY', 'TD', 'TYPE', 'YEAR', 'MONTH', 'DAY', 'TD']
with open(file_path, 'r') as my_file, open('outputfile.txt', 'w') as outer:
    reader = csv.reader(my_file, delimiter = ',')
    next(reader) #Skip the first row because it's just the header
    writer = csv.writer(outer, delimiter = '\t')

    writer.writerow(headers)

    filtered = filter(lambda r: type_getter(r) == 'TAKE', reader)
    for row in filtered:
        date1, type_, date2 = columns_getter(row)
        out_row = process_datestring(date1)
        out_row.append(type_)
        out_row.extend(process_datestring(date2))
        writer.writerow(out_row)
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
0

You could write a parse function. Something like this:

def parser(row):
    if "TAKE" in row[4]: #Checks for your 'TAKE' statement
        year = row[0][0:4]
        month = row[0][5:7]
        ...
        year_2 = row[5][0:4]
        ...
        return [year, month, day, td, row[4], year_2, month_2,...]

And just give this whole thing to your writer:

for row in reader:
    converted = parser(row)
    writer.writerow(converted)

This should work. If you want to be more carefull, you will have to convert the string to a datetime.datetime object using strptime and then extract the dates, but if chopping the string is good enough, you could do this.

J. P.
  • 306
  • 1
  • 8