17

VBA is not cutting it for me anymore. I have lots of huge Excel files to which I need to make lots of calculations and break them down into other Excel/CSV files.

I need a language that I can pick up within the next couple of days to do what I need, because it is kind of an emergency. I have been suggested python, but I would like to check with you if there is anything else that does CSV file handling quickly and easily.

Gaffi
  • 4,307
  • 8
  • 43
  • 73
Alex Gordon
  • 57,446
  • 287
  • 670
  • 1,062
  • 1
    Do you need to subset the data? If so, R is going to be much easier to use to manipulate than Python is, as its subsetting functions are built into the base language. – chimeracoder Jul 26 '10 at 22:25
  • 1
    In my experience, saving your Excel data as tab-delimited rather than CSV saves a lot of headaches. Parsing becomes a piece of cake, since all you have to do is split the file by "new line" characters and split the rows by tab characters. There are no quotes or apostrophes to deal with. Tab-delimited is also the native format if you copy/paste data from an Excel sheet into a text editor. – devuxer Jul 27 '10 at 00:22
  • 1
    @DanM: Your experience appears to be extremely limited. Both Excel 2003 and Excel 2007 behave as follows when doing "save as Text (Tab delimited)": If a field contains any commas tabs, or newlines, the field is wrapped in quotes. If a field contains any quotes, each quote is doubled and the whole field is then wrapped in quotes. In other words, very similar to CSV. – John Machin Jul 27 '10 at 01:31
  • @John Machin, alas, you're right. I haven't worked with delimited text files in a while, and I forgot that you need to use a workaround to avoid all the extra quoting: if you take a range of cells and paste them into a text editor, it won't quote anything unless you have tabs or carriage returns in your cells. The nice thing about tab delimiting is that you're a lot more likely to see commas in a spreadsheet than tab characters. – devuxer Jul 27 '10 at 01:51
  • 1
    @DanM: Why is that "nice"? **The ludicrous thing about tab delimiting is that Excel uses quoting needlessly when there are any commas in a field**. Select/copy/paste into a text editor??? Aarrgghh ... Holy multiple failure mode strategy, Batman! – John Machin Jul 27 '10 at 02:41
  • @John Machin, I didn't say the workaround was nice, just the results of the workaround :) It wouldn't take much VBA to write a macro that outputs "pure" tab-delimited text files, ones that never contain extra quotes (provided your document contains no tab characters). (And I agree with you that what Excel does when you save as "text tab-delimited" is ludicrous.) – devuxer Jul 27 '10 at 06:59

13 Answers13

14

Python is an excellent choice. The csv module makes reading and writing CSV files easy (even Microsoft's, uh, "idiosyncratic" version) and Python syntax is a breeze to pick up.

I'd actually recommend against Perl, if you're coming to it fresh. While Perl is certainly powerful and fast, it's often cryptic to the point of incomprehensible to the uninitiated.

Chris B.
  • 85,731
  • 25
  • 98
  • 139
  • i appreciate this answer, however after asking this: http://stackoverflow.com/questions/3339694/python-csv-getting-subset i am not convinced python would be my first choice. R is probably better for me, but it will take too long to learn probably – Alex Gordon Jul 27 '10 at 03:45
  • What's wrong with Python versus R? Alex's answer to your other question seems very simple and straight-forward. – pr1001 Jul 27 '10 at 08:42
  • R may take a long time to learn if you want to do advanced programming, but for manipulating .csv files, it couldn't be easier. read.csv() and write.csv() are the built-in functions you need. – chimeracoder Jul 27 '10 at 13:50
7

What kind of calculation you have to do? Maybe R would be an alternative?

EDIT: just to give a few basic examples

# Basic usage
data <- read.csv("myfile.csv")

# Pipe-separated values
data <- read.csv("myfile.csv", sep="|")

# File with header (columns will be named as header) 
data <- read.csv("myfile.csv", header=TRUE)

# Skip the first 5 lines of the file
data <- read.csv("myfile.csv", skip=5)

# Read only 100 lines
data <- read.csv("myfile.csv", nrows=100)
nico
  • 50,859
  • 17
  • 87
  • 112
  • nothing intensive, just need to be able to work with large files and take logs and also need to be able to create csv files fast – Alex Gordon Jul 26 '10 at 22:18
  • 1
    @I__: Then I would say R is an excellent choice. Reading a csv file is just a `read.csv` call away... – nico Jul 26 '10 at 22:35
  • I've never used the language myself, but looking it over for 10 minutes, it does indeed look like it would make dealing with csv's easy. – T.E.D. Jul 27 '10 at 13:12
5

Python definitely has a small learning curve, and works with csv files well

PaulMcG
  • 62,419
  • 16
  • 94
  • 130
John Howard
  • 61,037
  • 23
  • 50
  • 66
5

There are many tools for the job, but yes, Python is perhaps the best these days. There is a special module for dealing with csv files. Check the official docs.

ssegvic
  • 3,123
  • 1
  • 20
  • 21
3

You know VBA? Why not Visual Basic 2008 / 2010, or perhaps C#? I'm sure languages like python and ruby would be relatively easier for the job, but you're already accustomed to the ".NET way" of doing things, so it makes sense to keep working with them instead of learning a whole new thing just for this job.

Using C#:

var csvlines = File.ReadAllLines("file.csv");

var query = from csvline in csvlines
            let data = csvline.Split(',')
            select new
            {
                ID = data[0],
                FirstName = data[1],
                LastName = data[2],
                Email = data[3]
            };
  1. .NET: Linq to CSV library.
  2. .NET: Read CSV with LINQ
  3. Python: Read CSV file
devuxer
  • 41,681
  • 47
  • 180
  • 292
rebelliard
  • 9,592
  • 6
  • 47
  • 80
  • Is VBA really that similar to .NET? Last time I used it, I found that felt quite different. – Rei Miyasaka Jul 26 '10 at 22:31
  • 3
    typical CVS files aren't as simple to parse as split(',') –  Jul 26 '10 at 22:35
  • 2
    I strongly recommend not doing this. The problem with CSV is on the face of it its so simple. Everyone feels they should write their own parser. DON'T You'll get caught up with the quoting of items or similar. People write libaries which are shared by 100s of users. These are nearly always going to be more stable than something written yourself. Having said that I don't think VB is necessairly the wrong way to go about it if you know VBA. Be warned though VBA is much closer to VB Classic than VB.net – Wes Jul 26 '10 at 22:47
  • @DanM: **Tab-delimited is NOT a much simpler format** You don't appear to consider that fields can contain commas, tabs, newlines and quotes. Try it some time and see what you get. – John Machin Jul 27 '10 at 01:34
3

You say you have "excel files to which i need to make lots of calculations and break them down into other excel/csv files" but all the answers so far talk about csv only ...

Python has a csv read/write module as others have mentioned. There are also 3rd party modules xlrd (reads) and xlwt (writes) modules for XLS files. See the tutorial on this site.

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • R has support for Excel files too – nico Jul 26 '10 at 22:35
  • @nico: The R docs say "The first piece of advice is to avoid doing so if possible!" [why?] and then go on to list a miscellany of 3rd party packages ... doesn't sound like "has support" to me :) – John Machin Jul 26 '10 at 23:02
  • doing what? Using Excel for stats? I totally agree with that (see for instance yalta.etu.edu.tr/yalta-excel2007-statistical.pdf www.oci.uzh.ch/efiles/CHE103/chartjunk_excel_CSDA.pdf www.pages.drexel.edu/~bdm25/excel2007.pdf ) :D Anyways, R is strongly community driven, the fact that there are several packages that allow you to open Excel files/manipulate them and even execute R code from within Excel counts as "has support" for me! – nico Jul 27 '10 at 06:33
2

Perl is surprisingly efficient for a scripting language for text. cpan.org has a tremendous number of modules for dealing with CSV data. I've also both written and wrote data in XLS format with another Perl module. If you were able to use VBA, you can certainly learn Perl (the basics of Perl are easy, though it's just as easy for you or others to write terse yet cryptic code).

Hut8
  • 6,080
  • 4
  • 42
  • 59
  • Fully fledged CSV parsers such as Perl's `Text::CSV_XS` are purpose-built to handle weirdness such as commas within fields, wrapped by quotes. I provided sample code within my answer here: http://stackoverflow.com/questions/314384/parse-csv-file-using-gawk/33380831#33380831 – Chris Koknat Oct 28 '15 at 21:58
2

That depends on what you want to do with the files.

Python's learning curve is less steep than R's. However, R has a bunch of built-in functions that make it very well suited for manipulating .csv files easily, particularly for statistical purposes.

Edit: I'd recommend R over Python for this purpose alone, if only because the basic operations (reading files, dropping rows, dropping columns, etc.) are slightly faster to write in R than in Python.

chimeracoder
  • 20,648
  • 21
  • 60
  • 60
2

I'd give awk a try. If you're running windows, you can get awk via the cygwin utilities.

Dan Breslau
  • 11,472
  • 2
  • 35
  • 44
  • Dang. Beat my answer by seconds. However, since it is the same answer, you are clearly a genius as well. +1 :-) – T.E.D. Jul 26 '10 at 22:31
  • Heh. Apparently someone doesn't agree; I have a downvote on this. But thanks :-) – Dan Breslau Jul 26 '10 at 22:42
  • 1
    It wasn't me with the -1 but do please tell us how awk handles fields containing quotes, commas, carriage returns, and line feeds. – John Machin Jul 26 '10 at 23:05
  • 1
    @John Machin: No, awk does not solve CSV in the general case. The OP indicated that a lot of "calculations" need to be done on the file, which led me to believe that the data is mostly numeric. One can also write pre-processors to handle various kinds of special characters in the input. That may sound like overkill, but remember that the OP was asking for a small learning curve, and awk does have that (IMHO.) Here's such a pre-processor (that I didn't write:) http://stackoverflow.com/questions/314384/parse-csv-file-using-gawk/322812#322812 – Dan Breslau Jul 27 '10 at 02:10
  • @Dan Breslau: Ugh. That requires the caller to select as "repl" some character (default ~) that doesn't exist in the data. Yes, one can write such a pre-processor but it's a far cry from the supported and robust packages/modules that exist for languages like Python and Perl. – John Machin Jul 27 '10 at 02:37
2

This may not be anybody's popular language du-jour, but since CSV files are line-oriented and split into fields, dealing with them is just about the perfect application for awk. It was built for processing line oriented text data that can be split into fields.

Most of the other languages folks are going to reccomend will be much more general-purpose, so there's going to be a lot more in them that isn't nessecarily applicable to processing line-oriented text data.

T.E.D.
  • 44,016
  • 10
  • 73
  • 134
  • 1
    Unless awk has been upgraded recently (last 10 years!), it doesn't handle the intricate parts of CSV format, like commas and/or quotes in a field e.g. an address field contains `"DunRomin", Highway 987, Walla Walla, WA` -- you'll need to write or copy code to do it. – John Machin Jul 26 '10 at 22:43
  • @John Machin - Yes, it is true that you have to do a bit of work to get awk to support fields with commas in them. You are absolutely right that is a drawback. If your data might have such things, that's something you have to take account of. However, none of my CSV work has required that yet. If you are in control of what generates the input too, it is typically easily avoided. – T.E.D. Jul 27 '10 at 12:52
1

PowerShell has CSV import built in.

The syntax is ugly as death, but it's designed to be useful for administrators more than for programmers -- so who knows, you might like it.

It's supposed to be a quick get-up-and-go language, for better and worse.

Rei Miyasaka
  • 7,007
  • 6
  • 42
  • 69
0

I'm surprised nobody's suggested PowerQuery; it's perfect for consolidating and importing files to Excel, does column calculations nicely and has a good graphical editor built in. Works for csvs and excel files but also SQL databases and most other things you'd expect. I managed to get some basic cleaning and formatting stuff up and running in a day, maybe a few days to start writing my own functions (break free from the GUI)

And since it only really does database stuff, it's got barely any functions to learn (the actual language is called "M")

Greedo
  • 4,967
  • 2
  • 30
  • 78
-1

PHP has a couple of csv functions that are easy to use: http://www.php.net/manual-lookup.php?pattern=csv&lang=en

Mchl
  • 61,444
  • 9
  • 118
  • 120