1

i have this huge csv file, it's 4GB, don't know how many rows but 320 columns.

since it can't be open in any program (except using 3rd party programs to split the file into multiple pieces) i'm trying to fins a way to extract the data i need. i only need about 10-15 columns from it.

i saw many solutions on the net (most in vbs) but i couldn't get any of them to work. i'd get errors and i don't know vbs to be able to troubleshoot them.

can anyone help please?

thank you

PS here's one example of the vbs code i found and tried using that i had no luck with.

the original error was "800a01f4 variable is undefined", on the net it was suggested to take out OPTION EXPLICIT. once i do that the next error is "800a01fa class not defined".

in both cases the line giving the error is "Set adoJetCommand = New ADODB.Command"

Option Explicit



Dim adoCSVConnection, adoCSVRecordSet, strPathToTextfile
Dim strCSVFile, adoJetConnection,adoJetCommand, strDBPath


Const adCmdText = &H0001

' Specify path to CSV file.
strPathToTextFile = "C:\Users\natalie.rynda\Documents\Temp\RemailMatch\"

' Specify CSV file name.
strCSVFile = "NPIOld.csv"

' Specify Access database file.
strDBPath = "C:\Users\natalie.rynda\Documents\Temp\RemailMatch\NPIs.mdb"

' Open connection to the CSV file.
Set adoCSVConnection = CreateObject("ADODB.Connection")
Set adoCSVRecordSet = CreateObject("ADODB.Recordset")

' Open CSV file with header line.
adoCSVConnection.Open "Provider=Microsoft.Jet.OLEDB.4.0;" & _
    "Data Source=" & strPathtoTextFile & ";" & _
    "Extended Properties=""text;HDR=YES;FMT=Delimited"""

adoCSVRecordset.Open "SELECT * FROM " & strCSVFile, adoCSVConnection

' Open connection to MS Access database.
Set adoJetConnection = CreateObject("ADODB.Connection")
adoJetConnection.ConnectionString = "DRIVER=Microsoft Access Driver (*.mdb);" _
    & "FIL=MS Access;DriverId=25;DBQ=" & strDBPath & ";"
adoJetConnection.Open

' ADO command object to insert rows into Access database.
Set adoJetCommand = New ADODB.Command


Set adoJetCommand.ActiveConnection = adoJetConnection
adoJetCommand.CommandType = adCmdText

' Read the CSV file.
Do Until adoCSVRecordset.EOF
    ' Insert a row into the Access database.
    adoJetCommand.CommandText = "INSERT INTO NPIs " _
        & "(NPI, EntityTypeCode, ReplacementNPI, EIN, MAddress1, MAddress2, MCity, MState, MZIP, SAddress1, SAddress2, SCity, SState, SZIP, ProviderEnumerationDate, LastUpdateDate, NPIDeactivationReasonCode, NPIDeactivationDate, NPIReactivationDate) " _
        & "VALUES (" _
            & "'" & adoCSVRecordset.Fields("NPI").Value & "', " _
            & "'" & adoCSVRecordset.Fields("Entity Type Code").Value & "', " _
            & "'" & adoCSVRecordset.Fields("Replacement NPI").Value & "', " _
            & "'" & adoCSVRecordset.Fields("Employer Identification Number (EIN)").Value & "', " _
            & "'" & adoCSVRecordset.Fields("Provider First Line Business Mailing Address").Value & "', " _
            & "'" & adoCSVRecordset.Fields("Provider Second Line Business Mailing Address").Value & "', " _
            & "'" & adoCSVRecordset.Fields("Provider Business Mailing Address City Name").Value & "', " _
            & "'" & adoCSVRecordset.Fields("Provider Business Mailing Address State Name").Value & "', " _
            & "'" & adoCSVRecordset.Fields("Provider Business Mailing Address Postal Code").Value & "', " _
            & "'" & adoCSVRecordset.Fields("Provider First Line Business Practice Location Address").Value & "', " _
            & "'" & adoCSVRecordset.Fields("Provider Second Line Business Practice Location Address").Value & "', " _
            & "'" & adoCSVRecordset.Fields("Provider Business Practice Location Address City Name").Value & "', " _
            & "'" & adoCSVRecordset.Fields("Provider Business Practice Location Address State Name").Value & "', " _
            & "'" & adoCSVRecordset.Fields("Provider Business Practice Location Address Postal Code").Value & "', " _
            & "'" & adoCSVRecordset.Fields("Provider Enumeration Date").Value & "', " _
            & "'" & adoCSVRecordset.Fields("Last Update Date").Value & "', " _
            & "'" & adoCSVRecordset.Fields("NPI Deactivation Reason Code").Value & "', " _
            & "'" & adoCSVRecordset.Fields("NPI Deactivation Date").Value & "', " _
            & "'" & adoCSVRecordset.Fields("NPI Reactivation Date").Value & "')"
    adoJetCommand.Execute
    adoCSVRecordset.MoveNext
Loop



' Clean up.
adoCSVRecordset.Close
adoCSVConnection.Close
adoJetConnection.Close
HansUp
  • 95,961
  • 11
  • 77
  • 135
lalachka
  • 403
  • 5
  • 16
  • 36
  • i want to add that i saw this posting http://stackoverflow.com/questions/427488/want-vba-in-excel-to-read-very-large-csv-and-create-output-file-of-a-small-subse?rq=1 and tried out the vbs option (error "no value given for one of the required parameters" and i didn't understand the vba solution. this is to say that i didn't just post without first spending a few hours searching and trying everything i can. thank you!! – lalachka Jul 27 '12 at 00:37
  • thank you, i will fix but i'm afraid my errors throws out before i even get to that point – lalachka Jul 27 '12 at 01:25
  • i just checked and i don't see where the fields don't match. – lalachka Jul 27 '12 at 01:29
  • Try Dim adoJetCommand As New ADODB.Command up in your DIM statements and see if it gets you farther in the step through of your code. Also check you have a reference to the Microsoft Activex Data Objects Library – datatoo Jul 27 '12 at 02:51
  • good reference http://msdn.microsoft.com/en-us/library/ms675101(VS.85).aspx – datatoo Jul 27 '12 at 02:59
  • thank you so much for the reference, checking it out – lalachka Jul 27 '12 at 19:07
  • well I actually believe this may be a references issue. did you check that the Library is available? That would explain the the undefined parameter – datatoo Jul 27 '12 at 19:10
  • no, i didn't. i don't know anything about libraries or classes. how do i check? just send me somewhere where i can learn, i don't expect you guys to spoon feed this to me – lalachka Jul 27 '12 at 19:13
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/14537/discussion-between-lalachka-and-datatoo) – lalachka Jul 27 '12 at 19:18

2 Answers2

1

If your CSV file is straightforward, without newlines or commas in unexpected places, then the standard *nix tool awk would be useful. It would allow you to easily extract the 15 columns you are looking for to a new CSV file. This blog post gives an explanation how to use it on CSV files.

Suppose that you want to extract columns 1, 3 and 7 from file.csv, then you could do this with the command

awk -F, '{print $1","$3","$7;}' file.csv

Your Windows machine probably does not have awk installed. There are a few options:

  • You can find it in MSYS, which basically provides you with a Unix-like shell environment in Windows. To me, this seems to be the easies way to go.

  • Another option seems to be Gawk for Windows, but I have no experience with that, so no guarantees.

  • You could try to achieve the same result using the Windows PowerShell, as explained in this blog post -- if you have that available. Again, I have no experience trying that.

  • Last but not least, you could switch to Linux, for example in a virtual machine. awk is usually available in *nix environments.

If you are parsing a more awkward CSV file, then check out parse csv file using gawk for a bunch of suggestions.

Community
  • 1
  • 1
Reinier Torenbeek
  • 16,669
  • 7
  • 46
  • 69
  • thank you so much!! but for education purposes, how is this done with vba or vbs? (if you know)))) – lalachka Jul 27 '12 at 01:29
  • Sorry, have no experience with that. Does [this post](http://stackoverflow.com/q/736629/1380680) help at all? It is about VB.net though... – Reinier Torenbeek Jul 27 '12 at 02:57
  • @lalachka why did you accept this answer if you wanted `vba` or `vbs`? FWIW I like the approach, I've addded `awk` to the to do list – brettdj Jul 27 '12 at 03:52
  • because i'm new here and i don't understand the rules 100%. i thought if the answer is correct - then it's an answer. but it's true, i wanted vba or vbs. i just didn't want to be rude to Reinier)))) – lalachka Jul 27 '12 at 19:03
  • @lalachka No worries about being rude or breaking rules -- you will be corrected if you are not behaving properly anyway ;-). It is important to pose your questions accurately though because people are taking an effort to answer it. Your question does not mention you need VBA or VBS, and it does refer to third party programs. – Reinier Torenbeek Jul 27 '12 at 21:58
  • you're right and i apologize. is there anything i can do to correct it now? – lalachka Jul 27 '12 at 22:13
  • @lalachka It is no big deal. If you feel this was not the right answer after all, you could un-accept it. That might trigger people to look at it again though, so that mostly makes sense if you want to wait for any better answers. It would be polite to upvote the originally accepted answer, because it does look like the answer was useful. You could also add your final solution as an answer. – Reinier Torenbeek Jul 27 '12 at 22:37
  • thank you for explaining)))) i'd like to have that code troubleshooted, if possible. so you're saying that if i unaccept it and upvote it - that's not being rude? also, can i make changes to the original post to clarify things? – lalachka Jul 27 '12 at 22:45
  • 1
    @lalachka I would not perceive that as rude, since you explained your motivations. I can not speak for other people of course. However, in this case asking a new question sounds more appropriate to me. You are shifting from "how to handle a big CSV file" to "what is wrong with my code". You could cut the piece below your "Thank you" and use that as the new question. It is really up to what you want to do in the end. The users of this site together make up the unwritten rules of how to behave, and you are one of them. – Reinier Torenbeek Jul 28 '12 at 03:37
0

In the VBE editor

enter image description here

Then find in the List the Microsoft Activex Data Objects Library. Not sure which version might be appropriate, but probably 6

enter image description here

It seems like your code doesn't know what the ADODB.COMMAND is and this should resolve that. I only know I was able to copy your code, and was able to step through it successfully, when the reference was set. Hope this helps explain

datatoo
  • 2,019
  • 2
  • 21
  • 28
  • ohh, i'm running this in vbs. let me try vba. thank you. that's why i was so lost, i didn't understand how you can set references in vbs. – lalachka Jul 27 '12 at 20:01
  • i got that library, in vba the code goes past that point and gives an error "item cannot be found in the collection corresponding to the requested name or ordinal". in the debugger the chunk starting with "adoJetCommand.CommandText = "INSERT INTO NPIs " is highlighted – lalachka Jul 27 '12 at 21:03
  • it seems like the csv file is not being read properly. i already got what i needed done last night by splitting the csv file so this is strictly for education but i'd still love to find out what the problem is. thank you for your time. – lalachka Jul 27 '12 at 21:14
  • so has this moved you forward at all? or were you getting an error at the same point previously? – datatoo Jul 27 '12 at 22:12
  • yeah))) i passed the original error and got stuck on the "adoJetCommand.CommandText = "INSERT INTO NPIs " line. my guess is that the csv file is not being read. i tried putting "MsgBox adoCSVRecordSet.Fields("Provider Enumeration Date").Value" before that line (just to see if it will show me the values) and got the same error. so it seems like it's not reading the file – lalachka Jul 27 '12 at 22:21
  • I think to debug this, I would have to do a simplified, smaller file with a couple fields to make sure the path, and syntax work, then progressively evolve it to the huge file. – datatoo Jul 28 '12 at 15:40
  • @lalachka I created a sample db, and csv file to test this, with my own path and am able to successfully have it work. So if you are sure your path is correct, can you try a small excerpt of you csv? a smaller version. If you can get that to work, as I have, then it may be something about the 4gb size is the problem – datatoo Jul 28 '12 at 17:21
  • @lalachka I hadn't thought to check this previously, and should have. Access will only support 2gb files, so this won't work if that is where you are attempting to import to – datatoo Jul 29 '12 at 21:08
  • the end result (selected columns) will be much less than 2GB, that's why i'm even going through this trouble. otherwise i'd just open it from access and carve out what i need. – lalachka Jul 30 '12 at 15:56
  • Well I managed to do this with a csv file I created with a few hundred rows. Again and again, even lengthening the location path thinking it might be related. I would think it is your file or path. Sorry to not be more help – datatoo Jul 30 '12 at 16:40
  • omg, don't apologize))))) thank you for even trying. you're saying that you are able to run that code with no errors? the file i'm working with has 2M rows and 320 columns. i will try creating a fake small file and see what happens. – lalachka Jul 30 '12 at 19:48
  • yes, it IS my file, neither 4G or 320 columns are OK with access and i think that's why that code was written for VBS, not VBA. i'm going to try to repost it and see if anyone can help me fix that code. thank you soo much for all your help – lalachka Jul 30 '12 at 20:07
  • Would love to try with a fake file – datatoo Jul 31 '12 at 03:26
  • you mean for me to try with a fake file? i did and it worked, that's why i said that it IS my file that's the problem. sorry if i wasnt being clear)))) – lalachka Jul 31 '12 at 15:32
  • if that is the case, figuring out what is wrong is probably possible. Maybe the first line, or last. How is the file generated, and is this a frequently done task? It maybe possible to script the repair, if you can figure it out – datatoo Jul 31 '12 at 15:56
  • what do you mean? it's generated every month but i need it once in a blue. what do you mean by "first line or last" and script the repair – lalachka Jul 31 '12 at 16:07