Unable to infer schema for CSV file

Question

I am getting data in file1.dat file with data separated by | character.

109|LK98765|2|18.07.2021|01|abc1|01|abc2|01|abc3
110|LK67665|2|10.10.1987|02|abc1|01|abc2|01|abc3
111|LK43465|2|23.07.2005|03|abc1|01|abc2|01|abc3
112|LK23265|2|13.02.2012|04|abc1|01|abc2|01|abc3

My requirement is to add header to the file and change it to .csv with field separator as ,.

To achieve the above requirement, below code is written in python.

to add header:

def fn_add_header(file_name):
    with(open(file_name) as f:
        r=csv.reader(f)
        data = line [line for line in r]
    with(open(file_name,'wb') as f:
        w =csv.writer(f)
        w.writerow(['ID','SEC_NO','SEC_CD','SEC_DATE','SEC_ID1','SEC_DESC1','SEC_ID2','SEC_DESC2','SEC_ID3','SEC_DESC3'])
        w.writerows(data)

To change the file to csv:

def fn_replace(filename,directory)
    final_file = directory+"\file1.csv"
    for file in os.listdir(filename)
        if fnmatch.fnmatch(file.lower(),filename.lower()):
            shutil.copyfile (file,final_file )
            cmd = ["sed","-i","-e"'s/|/,/g',final_file )
            ret2,out2,err2 = fn_run_cmd(cmd)

The above code is working fine and I am getting the converted file as:

ID,SEC_NO,SEC_CD,SEC_DATE,SEC_ID1,SEC_DESC1,SEC_ID2,SEC_DESC2,SEC_ID3,SEC_DESC3
109,LK98765,2,18.07.2021,01,abc1,01,abc2,01,abc3
110,LK67665,2,10.10.1987,02,abc1,01,abc2,01,abc3
111,LK43465,2,23.07.2005,03,abc1,01,abc2,01,abc3
112,LK23265,2,13.02.2012,04,abc1,01,abc2,01,abc3

I am facing issue while reading the above converted file.csv in yml. To read the file i am using below code:

frameworkComponents:
 today_file:
   inputDirectoryPath: <path of the file>
   componentName: today_file
   componentType: inputLoader
   hadoopfileFormat: csv
   csvSep: ','

selectstmt:
 componentName: selectstmt
 componentType: executeSparlSQL
 sql: |-
      select ID,SEC_NO,
             SEC_CD,SEC_DATE,
             SEC_ID1,SEC_DESC1,
             SEC_ID2,SEC_DESC2,
             SEC_ID3,SEC_DESC3
        from today_file

write_file:
   componentName: write_file
   componentType: outputWriter
   hadoopfileFormat: avro
   numberofPartition: 1
   outputDirectoryPath: <path of the file>

precedence:
 selectstmt:
     dependsOn:
        today_file: today_file

 write_file:
     dependsOn:
        selectstmt: selectstmt

When I am running the yml I am getting below error.
Unable to infer schema for CSV. It must be specified manually.

When you create the `csv.reader`, you need to specify the a `delimiter='|'` keyword argument because the default is comma-separated fields. — martineau, Jul 18 '21 at 06:40

Bruno Vermeulen · Answer 1 · 2021-07-18T07:59:15.723

Your function fn_add_header has some syntax errors like unbalanced brackets in with; the statement data = line [line for line in r] (probably you want to do a list comprehension but then it should be data = [line for line in r]); and you define writing a byte object but reading and writing should be a text object.

You should rewrite the function fn_add_header as follows. Note when reading the data file to define delimiter='|', and when writing the new CSV file you need to define newline='' otherwise it will add an extra line for each row.

import csv

def fn_add_header(file_name):
    with open(file_name, 'rt') as f:
        r = csv.reader(f, delimiter='|')
        data = [line for line in r]

    new_file_name = file_name[:-4] + '.csv'
    with open(new_file_name, 'wt', newline='') as f:
        w = csv.writer(f)
        w.writerow(['ID','SEC_NO','SEC_CD','SEC_DATE','SEC_ID1','SEC_DESC1','SEC_ID2','SEC_DESC2','SEC_ID3','SEC_DESC3'])
        w.writerows(data)

fn_add_header('doodle_data.dat')

doodle_data.dat

109|LK98765|2|18.07.2021|01|abc1|01|abc2|01|abc3
110|LK67665|2|10.10.1987|02|abc1|01|abc2|01|abc3
111|LK43465|2|23.07.2005|03|abc1|01|abc2|01|abc3
112|LK23265|2|13.02.2012|04|abc1|01|abc2|01|abc3

doodle_data.csv

ID,SEC_NO,SEC_CD,SEC_DATE,SEC_ID1,SEC_DESC1,SEC_ID2,SEC_DESC2,SEC_ID3,SEC_DESC3
109,LK98765,2,18.07.2021,01,abc1,01,abc2,01,abc3
110,LK67665,2,10.10.1987,02,abc1,01,abc2,01,abc3
111,LK43465,2,23.07.2005,03,abc1,01,abc2,01,abc3
112,LK23265,2,13.02.2012,04,abc1,01,abc2,01,abc3

Finally you should call the function something like convert_and_add_header as that is what it is actually doing.

Thanks for your response. I tried the solution provided by you and it worked fine for the file that i have prepared for testing. But when i tried the same code for the file that we have received from source i got error message like "UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 46" — user8592197, Jul 22 '21 at 07:31
The following: https://stackoverflow.com/questions/19699367/for-line-in-results-in-unicodedecodeerror-utf-8-codec-cant-decode-byte may give some solutions to that problems... — Bruno Vermeulen, Jul 22 '21 at 11:59

Unable to infer schema for CSV file

1 Answers1