0

I have thousands of .csv files that I'd like to import into a sqlite3 database. Each .csv file has labels on the first row.

My idea is to run a for loop over the file names and import them using the .import command:

    sqlite3 options.db ".mode csv data" ".import $f data"

The problem is that this will import the first row that contains the labels.

Question: How do I .import the .csv files without importing the first row of each file?

Thanks for helping! :D

Guilherme Salomé
  • 1,899
  • 4
  • 19
  • 39
  • 1
    I generally just import it all and then delete the header rows. You could also use `grep` and `cat` and maybe some `sed` ,`awk` to clean the files up first. – Kevin Aug 18 '17 at 04:08
  • @Kevin The thing is that the headers are all text, while some of the data types that I'm using are integers or reals. Would .import still work properly? – Guilherme Salomé Aug 18 '17 at 04:09
  • @Kevin Any ideas on how to "pipe" (i'm a noob sorry) `sed 1d $f` into the `.import`? – Guilherme Salomé Aug 18 '17 at 04:11
  • Are you against just quickly scripting it in Python? I can see that being reasonably easy as existing libraries allow you to ignore the first row – Jonathan Holland Aug 18 '17 at 04:12
  • 1
    @JonathanHolland Scripting in bash =/ – Guilherme Salomé Aug 18 '17 at 04:13
  • sqlite will work fine importing the text into an integer type column. It is dynamically typed on each `value` not each `column` – Kevin Aug 18 '17 at 04:15
  • I understand that from the tagging, just pointing out there are other options if the goal isn't to deliberately achieve the goal only in bash :) – Jonathan Holland Aug 18 '17 at 04:16
  • @JonathanHolland Indeed, that is a great option :D It's just that I don't remember anything of Python, and I haven't used it in a couple of years. But I should probably relearn the basics at least for my future sanity. – Guilherme Salomé Aug 18 '17 at 04:18

3 Answers3

1

You could install the Spatialite extension for sqlite and create a virtual table from each csv - it will automatically use the first row as the column names - and then simply create a new table, or import into an existing one by selecting all the records from the virtual table you created. Obviously you'd want to create a script that does this since you have thousands of csv files to import.

Alternatively you could write a script that uses the Unix cat function but skips the first row, if you're importing all the files into the same table.

Or write a script that parses the first row to define the columns in a table create command and then iterates through the rest of the rows to populate the new table.

Do any of those options meet your needs?

Momus
  • 394
  • 2
  • 13
  • All good suggestions. However if all the files are going in one table, they all need to have the same number of columns. Sqlite3 will choke on mismatched imports. – Kevin Aug 18 '17 at 04:27
  • I'm importing all files into the same table. It seems right now that the easiest solution might be to just import everything and afterwards remove the header rows. – Guilherme Salomé Aug 18 '17 at 04:32
  • If you're importing them all into the same table then use option 2 - write a script that 'cat's the csv files one after the other, skipping the first row, and then importing them. @Kevin is right though, the import will choke and die if you have different numbers of columns in the files. – Momus Aug 18 '17 at 04:52
  • @Kevin hmmm that is kind of a problem. Some of the files have 18 columns, while other have 23. Sqlite3 does seem to import them, but gives me a warning "expected 23 but found 18 - filling the rest with NULL". – Guilherme Salomé Aug 18 '17 at 04:58
  • I do need to do something about some files having 18 while others have 23 columns, otherwise sqlite3 just prints a bunch of warnings and it takes forever to go over one file. Any ideas? – Guilherme Salomé Aug 18 '17 at 05:04
  • Are the they always the same 18 columns and the same 23 columns? You could write the import script to check the number of columns and use a different insert command depending on the column count. If they're not always the same columns then you're going to need to write something a bit more sophisticated. – Momus Aug 18 '17 at 05:23
  • Yes, some rows have 18 columns, other have 23 columns, but they are always the same. – Guilherme Salomé Aug 18 '17 at 05:26
  • Some rows in the same file have 18 while others have 23, or some files have 18 columns per row while other files have 23 columns per row? – Momus Aug 18 '17 at 05:32
1

You can use GNU awk (I don't have access to Mac awk so I can't test) to skip the first row and add the missing columns:

$ cat test
1,1
2,2
$ awk '
BEGIN { FS=OFS="," }        # set separators
FNR==1 { next }             # skip the first record(s)
NF==2 { NF=4 }              # if field count is 2 set it to 4
1                           # output
' test # > newpath/newfile  # you can use * and then some
2,2,,

Replace the NF values to suit your needs (NF==18 { NF=23 }) if I understood correctly from the comments to another answer. If the NF=4 part doesn't work in Mac awk, you can replace it with print $0 OFS OFS OFS OFS OFS; next. Modern GNU awks have the inplace edit possibility. See for example this,

James Brown
  • 36,089
  • 7
  • 43
  • 59
1

Assume file1.csv looks like this:

File1,Line1
File1,Line2
File1,Line3
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
1,2,3,4,5,6

And file2.csv looks like this:

File2,Line1
File2,Line2
File2,Line3
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
1,2,3,4,5,6

Then you can make awk print all your CSV files whilst skipping the first line (where FNR, i.e. the file line number, is 1) like this:

awk -F, 'FNR==1{next} 1' *csv

Output

File1,Line2
File1,Line3
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
1,2,3,4,5,6
File2,Line2
File2,Line3
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
1,2,3,4,5,6

Then you can check the number of fields present on each line, and if it is not 23, add as many commas as necessary to make it 23:

awk -F, 'FNR==1{next} NF!=23{$0=$0 substr(",,,,,,,,,,,,,,,,,,,,,,",1,23-NF)}1' *csv

Output

File1,Line2,,,,,,,,,,,,,,,,,,,,,
File1,Line3,,,,,,,,,,,,,,,,,,,,,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,,,,,
1,2,3,4,5,6,,,,,,,,,,,,,,,,,
File2,Line2,,,,,,,,,,,,,,,,,,,,,
File2,Line3,,,,,,,,,,,,,,,,,,,,,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,,,,,
1,2,3,4,5,6,,,,,,,,,,,,,,,,,

So, as an actual complete solution, you could do this:

awk -F, 'FNR==1{next} NF!=23{$0=$0 substr(",,,,,,,,,,,,,,,,,,,,,,",1,23-NF)}1' *csv > NewBigFile.csv

and then import NewBigFile.csv into sqlite.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • I ended up importing everything, suppressing the errors (not a good thing), and deleting the rows via SQL afterwards. But this looks like a better solution to my problem, and one that doesn't require me to ignore errors. Thanks! – Guilherme Salomé Aug 25 '17 at 17:17