Validate a csv file

Question

This is my sample file

#%cty_id1,#%ccy_id2,#%cty_src,#%cty_cd3,#%cty_nm4,#%cty_reg5,#%cty_natnl6,#%cty_bus7,#%cty_data8
690,ALL2,,AL,ALBALODMNIA,,,,
90,ALL2,,,AQ,AKNTARLDKCTICA,,,
161,IDR2,,AZ,AZLKFMERBALFKIJAN,,,,
252,LTL2,,BJ,BENLFMIN,,,,
206,CVE2,,BL,SAILFKNT BAFSDRTHLEMY,,,,
360,,,BW2,BOPSLFTSWLSOANA,,,,

The problem is for #%cty_cd3 is a standard column(NOT NULL) with length 2 letters only, but in sql server the record shifts to the other column,(due to a extra comma in btw)how do i validate a csv file,to make sure that when there's a 2 character word need to be only in 4 column?

there are around 10000 records ?

Set of rules Defined !

Should have a standard set of delimiters for eachrow
 if not
   Check for NOT NULL values having Null values
     If found Null
       remove delimiter at the pointer

The 3 ,,, are not replaced with 2 ,,

~~#UPDATED : Can i know if this can be done using a script ?~~

Updated i need only a function That operates on records like

90,ALL2,,,AQ,AKNTARLDKCTICA,,, correct them using a Regex or any other method and put back into the source file !

This SO [post](http://stackoverflow.com/questions/4205431/parse-a-csv-using-awk-and-ignoring-commas-inside-a-field) might be a starting point for you — gwillie, Feb 03 '14 at 09:30
What is the reason for the extra comma in the csv file? Why can't you simply remove it from the file? — Håkon Hægland, Feb 03 '14 at 10:05
@HåkonHægland..the src file comes as shown above,removing manually doesn't help! as there are around 10-15k records; — vhadalgi, Feb 03 '14 at 10:07
Ok, I see.. But it should be easy to use awk to remove them.. — Håkon Hægland, Feb 03 '14 at 10:09
NO,they are present through out :( commas are delimiters in this case,and presence of extra commas,,result in shifting of records to next-->column,, — vhadalgi, Feb 03 '14 at 10:14
So you would like to delete extra fields before a two character column (which is taken as a sign of column no. 4)? So if there are 5 fields before the two character column, you would like to delete one of those that are empty? — Håkon Hægland, Feb 03 '14 at 10:19
yes..there are fixed no of columns.. yes i need to do that! (del `,` before 2 character column) — vhadalgi, Feb 03 '14 at 10:20

score 4 · Answer 1 · answered Feb 05 '14 at 16:23

4

Your best bet here may be to use the tSchemaComplianceCheck component in Talend.

example job set up

If you read the file in with a tFileInputDelimited component and then check it with the tSchemaComplianceCheck where you set cty_cd to not nullable then it will reject your Antarctica row simply for the null where you expect no nulls.

tSchemaComplianceCheck to reject rows with null 'cty_cd'

From here you can use a tMap and simply map the fields to the one above.

tMap to sort right fielding

You should be able to easily tweak this as necessary, potentially with further tSchemaComplianceChecks down the reject lines and mapping to suit. This method is a lot more self explanatory and you don't have to deal with complicated regex's that need complicated management when you want to accommodate different variations of your file structure with the benefit that you will always capture all of the well formatted rows.

answered Feb 05 '14 at 16:23

ydaetskcoR

53,225
8
158
177

i gave a +1..we'll check tomorrow once i get on my system.;) – vhadalgi Feb 05 '14 at 17:16
its working fine(tweaking the rows fantastic !!)... but how do i join these destinations like the for `Accept goes into Excel1` and `Right yielding goes into excel2` then i to do a union(tunite doesn't work?) and send it to a Database! – vhadalgi Feb 06 '14 at 09:57
1

Frustratingly you can't recombine flows in Talend for no obvious good reason (might be something underlying). What you can do is output them to a flat file or a DB and then read the flat file or DB back in and join then. Link the sub jobs (or even separate jobs with a tRunJob) with an on subjob ok link and it should work fine. If you really want you could then have another subjob to delete the temporary split data. – ydaetskcoR Feb 06 '14 at 10:04
ok...but if'd be regexes how'd it be ? i don't think tweaking works in every case ? (i currently handling 10k records but if given to client this solution it might get fk'd up !!) (effficient solution'd be using regexes only) but thanks very much @ydea – vhadalgi Feb 06 '14 at 10:18
I would be inclined to run through all of the possible combinations of the data (profile it first to see what exists) and branch them all down nested rejects from `tSchemaComplianceCheck` components. A regex probably won't cover all possible instances and if it does then it will most likely be so complicated you won't be able to go back and change it to include a new case when you find one later on. This way your client can clearly see what is going on and could maybe even add new cases themselves. Plus you never have to worry about losing well formatted rows. – ydaetskcoR Feb 06 '14 at 10:43
I should add that I'm a big fan of regexs but with a poorly defined pattern (having potential extra commas anywhere in the string which shoves fields along) is not a good candidate for regexs. Instead consider focusing on your strongest condition - non null fields being null when the row is badly formatted. – ydaetskcoR Feb 06 '14 at 10:45
i'm currently working on it! let you know what comes out!! (and yes best part is to separate not null fiellds having nulls then tweak them!) – vhadalgi Feb 06 '14 at 10:49

Håkon Hægland · Answer 2 · 2014-02-03T10:43:33.553

You could try to delete the empty field in column 4, if column no. 4 is not a two-character field, as follows:

awk 'BEGIN {FS=OFS=","}
{
    for (i=1; i<=NF; i++) {
        if (!(i==4 && length($4)!=4))
            printf "%s%s",$i,(i<NF)?OFS:ORS
    }
}' file.csv

Output:

"id","cty_ccy_id","cty_src","cty_nm","cty_region","cty_natnl","cty_bus_load","cty_data_load"
6,"ALL",,"AL","ALBANIA",,,,
9,"ALL",,"AQ","ANTARCTICA",,,
16,"IDR",,"AZ","AZERBAIJAN",,,,
25,"LTL",,"BJ","BENIN",,,,
26,"CVE",,"BL","SAINT BARTH�LEMY",,,,
36,,,"BW","BOTSWANA",,,,
41,"BNS",,"CF","CENTRAL AFRICAN REPUBLIC",,,,
47,"CVE",,"CL","CHILE",,,,
50,"IDR",,"CO","COLOMBIA",,,,
61,"BNS",,"DK","DENMARK",,,,

Note:

We use length($4)!=4 since we assume two characters in column 4, but we also have to add two extra characters for the double quotes..

You cut off the title "cty_cd", which will make the whole sql input wrong. — BMW, Feb 04 '14 at 06:14
@BMW Thanks for the comment! It should be fixed by adding `NR==1{print; next}` at the top.. — Håkon Hægland, Feb 04 '14 at 06:58

Gabriele B · Answer 3 · 2014-02-06T10:49:03.643

3

The solution is to use a look-ahead regex, as suggested before. To reproduce your issue I used this:

"\\,\\,\\,(?=\\\"[A-Z]{2}\\\")"

which matches three commas followed by two quoted uppercase letters, but not including these in the match. Ofc you could need to adjust it a bit for your needs (ie. an arbitrary numbers of commas rather than exactly three).

But you cannot use it in Talend directly without tons of errors. Here's how to design your job: job design

In other words, you need to read the file line by line, no fields yet. Then, inside the tMap, do the match&replace, like:

row1.line.replaceAll("\\,\\,\\,(?=\\\"[A-Z]{2}\\\")", ",,")

tMap definition

and finally tokenize the line using "," as separator to get your final schema. You probably need to manually trim out the quotes here and there, since tExtractDelimitedFields won't.

Here's an output example (needs some cleaning, ofc):

output snippet

You don't need to entry the schema for tExtractDelimitedFields by hand. Use the wizard to record a DelimitedFile Schema into the metadata repository, as you probably already did. You can use this schema as a Generic Schema, too, fitting it to the outgoing connection of tExtractDelimitedField. Not something the purists hang around, but it works and saves time.

About your UI problems, they are often related to file encodings and locale settings. Don't worry too much, they (usually) won't affect the job execution.

EDIT: here's a sample TOS job which shows the solution, just import in your project: TOS job archive

EDIT2: added some screenshots

edited Feb 06 '14 at 10:49

answered Feb 05 '14 at 15:59

Gabriele B

2,665
1
25
40

i gave a +1..we'll check tomorrow once i get on my system.;) – vhadalgi Feb 05 '14 at 17:18
I tried this,but theres no change in output file ,and ReplaceAll isn't working in tMap component? – vhadalgi Feb 06 '14 at 06:44
After i put `row2.line.replaceAll("\\,\\,\\,(?=\\\"[A-Z]{2}\\\")", ",,")` and press `test` it shows null? and then the output coming from `textractdelimited` component Antartica thing `doesn't get shifted` ? can you pls try to execute this ! and give me a detailed answer ? i'm skrewed up! – vhadalgi Feb 06 '14 at 09:14
I executed it! :) you probably mess something with the connections names. I exported my example job and uploaded here together with some screenshots. Hope this helps! – Gabriele B Feb 06 '14 at 10:50
previously i messed up.. but thers a change here..they are no more " "(double quotes) in file(.csv) – vhadalgi Feb 06 '14 at 11:49
can you see updated !! my bad ! the current csv has no double quotes ! – vhadalgi Feb 06 '14 at 12:19
Thanks it was ur plan to use a tFullrow "Grazie" !! – vhadalgi Feb 06 '14 at 12:56
@gabi...man do you mind checking this one more time,, does Talend have anything where i can put a vb script ! – vhadalgi Feb 11 '14 at 08:43

Graham Anderson · Answer 4 · 2014-02-15T20:49:04.893

Coming to the party late with a VBA based approach. An alternative way to regex is to to parse the file and remove a comma when the 4th field is empty. Using microsoft scripting runtime this can be acheived the code opens a the file then reads each line, copying it to a new temporary file. If the 4 element is empty, if it is it writes a line with the extra comma removed. The cleaned data is then copied to the origonal file and the temporary file is deleted. It seems a bit of a long way round, but it when I tested it on a file of 14000 rows based on your sample it took under 2 seconds to complete.

Sub Remove4thFieldIfEmpty()

    Const iNUMBER_OF_FIELDS As Integer = 9

    Dim str As String
    Dim fileHandleInput As Scripting.TextStream
    Dim fileHandleCleaned As Scripting.TextStream
    Dim fsoObject As Scripting.FileSystemObject
    Dim sPath As String
    Dim sFilenameCleaned As String
    Dim sFilenameInput As String
    Dim vFields As Variant
    Dim iCounter As Integer
    Dim sNewString As String

    sFilenameInput = "Regex.CSV"
    sFilenameCleaned = "Cleaned.CSV"
    Set fsoObject = New FileSystemObject

    sPath = ThisWorkbook.Path & "\"


    Set fileHandleInput = fsoObject.OpenTextFile(sPath & sFilenameInput)

    If fsoObject.FileExists(sPath & sFilenameCleaned) Then
        Set fileHandleCleaned = fsoObject.OpenTextFile(sPath & sFilenameCleaned, ForWriting)
    Else
        Set fileHandleCleaned = fsoObject.CreateTextFile((sPath & sFilenameCleaned), True)
    End If


    Do While Not fileHandleInput.AtEndOfStream
        str = fileHandleInput.ReadLine
            vFields = Split(str, ",")
            If vFields(3) = "" Then
                sNewString = vFields(0)
                For iCounter = 1 To UBound(vFields) 
                    If iCounter <> 3 Then sNewString = sNewString & "," & vFields(iCounter)
                Next iCounter
                str = sNewString
            End If
        fileHandleCleaned.WriteLine (str)
    Loop


    fileHandleInput.Close
    fileHandleCleaned.Close

    Set fileHandleInput = fsoObject.OpenTextFile(sPath & sFilenameInput, ForWriting)
    Set fileHandleCleaned = fsoObject.OpenTextFile(sPath & sFilenameCleaned)

    Do While Not fileHandleCleaned.AtEndOfStream
        fileHandleInput.WriteLine (fileHandleCleaned.ReadLine)
    Loop

    fileHandleInput.Close
    fileHandleCleaned.Close



    Set fileHandleCleaned = Nothing
    Set fileHandleInput = Nothing

    KillFile (sPath & sFilenameCleaned)

    Set fsoObject = Nothing


End Sub

@VijaykumarHadalgi Oops I left a bug in the code, the line that read `iCounter = 1 To UBound(vFields)-2` should have read read `iCounter = 1 To UBound(vFields)` I have editted the solution it should work as expexted now. Also if you like, when I remove a comma I could put an extra comma at the end? — Graham Anderson, Feb 15 '14 at 20:50
Hi @VijaykumarHadalgi The original file will have the changes as above. We could avoid using a temporary file and store the lines in memory before writing to the file, alternatively we could open the CSV file in excel make the changes and then save the file. I could modify the subroutine to take a filename as an argument so that any file could be processed. What issue are you having with the code? — Graham Anderson, Feb 16 '14 at 23:24

score 1 · Answer 5 · answered Feb 03 '14 at 07:40

1

If that's the only problem (and if you never have a comma in the field bt_cty_ccy_id), then you could remove such an extra comma by loading your file into an editor that supports regexes and have it replace

^([^,]*,[^,]*,[^,]*,),(?="[A-Z]{2}")

with \1.

answered Feb 03 '14 at 07:40

Tim Pietzcker

328,213
58
503
561

i tried this! i get a error in Talend `Note: Preview errors are generally due to a wrong encoding setting. org.talend.designer.runprocess.shadow.ShadowFilePreview.preview(ShadowFilePreview.java:90) org.talend.repository.ui.utils.ShadowProcessHelper.getCsvArray(ShadowProcessHelper.java:383) org.talend.repository.ui.wizards.metadata.connection.files.regexp.RegexpFileStep2Form$PreviewProcessor.nonUIProcessInThread(RegexpFileStep2Form.java:518) org.talend.commons.ui.swt.thread.SWTUIThreadProcessor$1.run(SWTUIThreadProcessor.java:74)` i read docs from Talend but no help ! – vhadalgi Feb 05 '14 at 09:54
Well, which encoding setting have you used, and which encoding is used by your file? – Tim Pietzcker Feb 05 '14 at 10:08
its `UTF-8` sir i double checked ! – vhadalgi Feb 05 '14 at 10:37
not related to your issue, but about the execption: it's something related to Talend UI. These are often related to some helper commodities (ie. wizards, metadata, repository...), not the talend-generated job code. So, don't waste time on this error, which is not the cause of your CSV skewness. About the problem, let me check your data first :) – Gabriele B Feb 05 '14 at 15:27
can you help ! in telling what'd be if there are no double quotes? in file ? i'm Regex for above `.csv file` – vhadalgi Feb 06 '14 at 12:36

score 0 · Answer 6 · answered Feb 03 '14 at 07:59

0

i would question the source system which is sending you this file as to why this extra comma in between for some rows? I guess you would be using comma as a delimeter for importing this .csv file into talend.

(or another suggestion would be to ask for semi colon as column separator in the input file)

9,"ALL",,,"AQ","ANTARCTICA",,,,

will be

9;"ALL";,;"AQ";"ANTARCTICA";;;;

answered Feb 03 '14 at 07:59

garpitmzn

1,001
6
9

IMHO -Ideally it should be as you mentioned ! but that not the case! – vhadalgi Feb 05 '14 at 09:57

Validate a csv file

Set of rules Defined !

Updated i need only a function That operates on records like

6 Answers6

Linked