3

First I will describe what I have, then the problem.

I have a text file that is structured as such

----------- Start of file-----
<!-->
name,name2,ignore,name4,jojobjim,name3,name6,name9,pop
-->
<csv counter="1">
1,2,3,1,6,8,2,8,2,
2,6,5,1,5,8,7,7,9,
1,4,3,1,2,8,9,3,4,
4,1,6,1,5,6,5,2,9
</csv>
-------- END OF FILE-----------

I also have a perl program that has a map:

 my %column_mapping = (
"name" => 'name',
"name1" => 'name_1',
"name2" => 'name_2',
"name3" => 'name_3',
"name4" => 'name_4',
"name5" => 'name_5',
"name6" => 'name_6',
"name7" => 'name_7',
"name9" => 'name_9',
)

My dynamic insert statement (assume I connected to database proper, and headers is my array of header names, such as test1, test2, ect)

my $sql = sprintf 'INSERT INTO tablename ( %s ) VALUES ( %s )',
    join( ',', map { $column_mapping{$_} } @headers ),
    join( ',', ('?') x scalar @headers ); 

my $sth = $dbh->prepare($sql);

Now for the problem I am actually having: I need a way to only do an insert on the headers and for the values that are in the map. In the data file given as an exmaple, there are several names that are not in the map, is there a way I can ignore them and the numbers associated with them in the csv section?

basically to make a subset csv, to turn it into:

name,name2,name4,name3,name6,name9,
 1,2,1,8,2,8,
 2,6,1,8,7,7,
 1,4,1,8,9,3,
 4,1,1,6,5,2,

so that my insert statment will only insert the ones in the map. The data file is always different, and are not in same order, and an unknown amount will be in the map.

Ideally a efficient way to do this, since this script will be going through thousands of files, and each files behind millions of lines of the csv with hundreds of columns.

It is just a text file being read though, not a csv, not sure if csv libraries can work in this scenario or not.

Lain
  • 2,166
  • 4
  • 23
  • 47
  • @ThisSuitIsBlackNot pretty sure thats just for headers though, does figure out which data to skip over in the csv section? I don't beleive it does – Lain Aug 05 '15 at 19:57
  • Ah, I see what you're asking. You're right, it doesn't. – ThisSuitIsBlackNot Aug 05 '15 at 20:02
  • @ThisSuitIsBlackNot I relized after he updated that that wasn't going to solve it. I think this is the last perl related question I will have. This script is fairly complicated to me since I didnt know perl as of yesterday lol. So get stumped on a few spots unfortunately... – Lain Aug 05 '15 at 20:09
  • You were right to ask a new question, I apologize for not reading carefully enough. This is a fairly involved task to have to do after only one day of Perl! – ThisSuitIsBlackNot Aug 05 '15 at 20:13
  • You have a potential SQL injection attack in using `@headers` unescaped. They should, at minimum, be run through [`DBI->quote_identifier`](https://metacpan.org/pod/DBI#quote_identifier). Better would be to use [SQL::Abstract](https://metacpan.org/pod/SQL::Abstract) to construct your SQL statements or an ORM or like [DBIx::Class](https://metacpan.org/pod/DBIx::Class). – Schwern Aug 05 '15 at 20:17
  • Are you asking about how to efficiently parse that bizarre CSV embedded in XML format? Or only after that, once you've extracted the headers and columns? Or both? They're two different problems. Is ` – Schwern Aug 05 '15 at 20:20
  • @Schwern After I parsed it. WIll have headers in an array, wasnt sure what to do with the csv part, if I should store a line in a string then do something to it to remove the values. the stuff in the data file was just to show it wasnt a plain .csv file. Also no its not a typo. – Lain Aug 05 '15 at 20:22
  • @ThisSuitIsBlackNot it really is lol. The biggest problem is if someone answers a question perfectly I could still not understand:D (kinda like what happened to this answer to this question lol). – Lain Aug 05 '15 at 20:36
  • @Lain I would suggest cutting your question down to just the parsed data structure and remove all the XML and CSV distractions. – Schwern Aug 05 '15 at 21:07

2 Answers2

4

You would typically put the set of valid indices in a list and use array slices after that.

@valid = grep { defined($column_mapping{ $headers[$_] }) } 0 .. $#headers;

...

my $sql = sprintf 'INSERT INTO tablename ( %s ) VALUES ( %s )',
  join( ',', map { $column_mapping{$_} } @headers[@valid] ),
  join( ',', ('?') x scalar @valid);
my $sth = $dbh->prepare($sql);

...

my @row = split /,/, <INPUT>; 
$sth->execute( @row[@valid] );

...
mob
  • 117,087
  • 18
  • 149
  • 283
  • And speaking of slices, you could say `@column_mappings{@headers[@valid]}` instead of `map { $column_mappings{$_} } @headers[@valid]` – mob Aug 05 '15 at 20:30
  • Any chance you can edit and throw comment on the last 2 lines? Sorry just learning perl as of yesterday, kinda confused how those last 2 lines figure out which csv data to grab or skip over:S – Lain Aug 05 '15 at 20:34
  • It is reading a full row of data into the array `@row` (e.g., `"1,2,3,1,6,8,2,8,2,"` from the OP). It is passing the valid parts of `@row` (corresponding to columns with valid headers) to `$sql->execute()` – mob Aug 05 '15 at 20:36
  • Any idea when I try to use this code I get "Can't locate object method execute via package "INSERT INTO (then the rest of my insert statement). Am I missed a prepare statement or something? – Lain Aug 19 '15 at 13:06
  • 1
    @Lain - Edited. Use the `$sth` statement object the same way you did in the OP – mob Aug 19 '15 at 13:15
3

Because this is about four different questions in one, I'm going to take a higher level approach to the broad set of problems and leave the programming details to you (or you can ask new questions about the details).

I would get the data format changed as quickly as possible. Mixing CSV columns into an XML file is bizarre and inefficient, as I'm sure you're aware. Use a CSV file for bulk data. Use an XML file for complicated metadata.

Having the headers be an XML comment is worse, now you're parsing comments; comments are supposed to be ignored. If you must retain the mixed XML/CSV format put the headers into a proper XML tag. Otherwise what's the point of using XML?

Since you're going to be parsing a large file, use an XML SAX parser. Unlike a more traditional DOM parser which must parse the whole document before doing anything, a SAX parser will process it as it reads the file. This will save a lot of memory. I leave SAX processing as an exercise, start with XML::SAX::Intro.

Within the SAX parser, extract the data from the <csv> and use a CSV parser on that. Text::CSV_XS is a good choice. It is efficient and has solved all the problems of parsing CSV data you are likely to run into.

When you finally have it down to a Text::CSV_XS object, call getline_hr in a loop to get the rows as hashes, apply your mapping, and insert into your database. @mob's solution is fine, but I would go with SQL::Abstract to generate the SQL rather than doing it by hand. This will protect against both SQL injection attacks as well as more mundane things like the headers containing SQL meta characters and reserved words.

It's important to separate the processing of the parsed data from the parsing of the data. I'm quite sure that hideous data format will change, either for the worse or the better, and you don't want to tie the code to it.

Community
  • 1
  • 1
Schwern
  • 153,029
  • 25
  • 195
  • 336
  • If I could get how those files are given changed I would in a heart beat. The only other programmer at our company wrote the files to be stored like that. There are many many more problems with his files than the stuff I have posted. I can not put the headers into a tag, I just get given the files as is unfortunately, wrote my own mini parser to figure out which comment block is the header one, wasn't fun. The data format wont be changing, its been like that for 4 years;( – Lain Aug 05 '15 at 21:01
  • @Lain If its a technical problem of changing an old format, could you produce two copies of the data file? One is a legacy version for existing parsers and one is a new, sane format? Then you can incrementally change your parsers over to the more efficient format. Or is this a social problem where the other programmer produces crap and won't change? – Schwern Aug 05 '15 at 21:10
  • Its the programmer produces it and won't change. He has a lot more problems with his code than having to parse comments. Hes the professional programmer too and im the 2nd year uni summer student:D Have so much dumb code to deal with. Thanks for the detailed answer though, will be trying mob's answer today though to see if I can get it working like that. – Lain Aug 06 '15 at 14:45