-1

I have a sample file which looks like

emp_id(int),name(string),age(int)
1,hasa,34
2,dafa,45
3,fasa,12
8f,123Rag,12
8,fafl,12

Requirement: Column data types are specified as strings and integers. Emp_id should be a integer not string. these conditions ll be the same for name and age columns.

My output should look like#

Actual column Emp_id type is INT but string was found at the position 4, value is 8f
Actual column name type is STRING but numbers were found at the position 4, value is 123Rag

continues..

here is my code Shell script

read input
if [ $input -eq $input 2>/dev/null ]
then
     echo "$input is an integer"
else
    echo "$input is not an integer"
fi

In python, i was trying with Isinstance(obj,type) but it is not serving the purpose. can any guide me in this regard, Any shell/python/perl script help would be appreciated!

Gokul Krishna
  • 105
  • 2
  • 5

2 Answers2

1

Here is an awk-solution:

awk -F"," 'NR==1{for(i=1; i <= NF; i++){
                        split($i,a,"(");
                        name[i]=a[1]; 
                        type[i] = ($i ~ "int" ? "INT" : "String")}next}
           {for(i=1; i <= NF; i++){
               if($i != int($i) && type[i] == "INT"){error[i][NR] = $i}
               if($i ~ /[0-9]+/ && type[i] == "String"){error[i][NR] = $i}
           }}
           END{for(i in error){
                       for(key in error[i]){
                            print "Actual column "name[i]" type is "type[i]\
                                  " but string was found at the position "key-1\
                                  ", value is "error[i][key]}}}' inputFile

The output is - as desired:

Actual column emp_id type is INT but string was found at the position 4, value is 8f
Actual column name type is String but string was found at the position 4, value is 123Rag

However, in my opinion 123Rag is a string and should not be indicated as an incorrect entry in the second column.

glenn jackman
  • 238,783
  • 38
  • 220
  • 352
F. Knorr
  • 3,045
  • 15
  • 22
  • Your INT test is wrong IMO: the value `1.1` would pass. This is better: `$i != int($i)`. Otherwise, pretty much my thoughts too. – glenn jackman Dec 02 '15 at 17:45
  • @glenn jackman: Yes, you're right, of course! The `$i == $i+0` tests whether the value is a number at all (int or double doesn't matter). I somehow forgot the `int`-restriction. – F. Knorr Dec 02 '15 at 20:27
0

With perl I would tackle it like this:

  • Define some regex patterns that match/don't match the string content.
  • pick out the header row - separate it into names and types. (Optionally reporting if a type doesn't match).
  • iterate your fields, matching by column, figuring out type and applying the regex to validate

Something like:

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;

#define regex to apply for a given data type
my %pattern_for = (
    int    => qr/^\d+$/,
    string => qr/^[A-Z]+$/i,
);

print Dumper \%pattern_for;

#read the first line. 
# <> is a magic filehandle, that reads files specified as arguments 
# or piped input - like grep/sed do. 
my $header_row = <>;
#extract just the names, in order. 
my @headers = $header_row =~ m/(\w+)\(/g;
#create a type lookup for the named headers. 
my %type_for = $header_row =~ m|(\w+)\((\w+)\)|g;

print Dumper \@headers;
print Dumper \%type_for;

#iterate input again
while (<>) {
    #remove trailing linefeed
    chomp;

    #parse incoming data into named fields based on ordering. 
    my %fields;
    @fields{@headers} = split /,/;
    #print for diag
    print Dumper \%fields;

    #iterate the headers, applying the looked up 'type' regex
    foreach my $field_name (@headers) {
        if ( $fields{$field_name} =~ m/$pattern_for{$type_for{$field_name}}/ ) {
            print
                "$field_name => $fields{$field_name} is valid, $type_for{$field_name} matching $pattern_for{$type_for{$field_name}}\n";
        }
        else {
            print "$field_name $fields{$field_name} not valid $type_for{$field_name} matching $pattern_for{$type_for{$field_name}}\n";
        }
    }
}

This gives for your input (just the invalids for brevity):

name 123Rag not valid string matching (?^i:^[A-Z]+$)
emp_id 8f not valid int matching (?^:^\d+$)

Note - it only supports 'simple' CSV style (no nested commas or quotes) but could easily be adapted to use the Text::CSV module.

Sobrique
  • 52,974
  • 7
  • 60
  • 101