Python or awk/sed for cleaning data

Question

I use R for data analysis and am very happy with it. Cleaning data could be a bit easier, however. I am thinking about learning another language suited to this task. Specifically, I am looking for a tool to use to take raw data, remove unnecessary variables or observations, and format it for easy loading in R. Contents would be mostly numeric and string data, as opposed to multi-line text.

I am considering the awk/sed combination versus Python. (I recognize that Perl would be another option, but, if I was going to learn another full language, Python seems to be a better, more extensible choice.)

The advantage of sed/awk is that it would be quicker to learn. The disadvantage is that this combination isn't as extensible as Python. Indeed, I might imagine some "mission creep" if I learned Python, which would be fine, but not my goal.

The other consideration that I had is applications to large data sets. As I understand it, awk/sed operate line-by-line, while Python would typically pull all the data into memory. This could be another advantage for sed/awk.

Are there other issues that I'm missing? Any advice that you can offer would be appreciated. (I included the R tag for R users to offer their cleaning recommendations.)

by "cleaning", you mean clipping outliers or restoring consistency or anything else? by "data", you mean majorly numbers or strings, or simply text? to me this target of current question is too general. — nye17, Sep 20 '11 at 03:25
I use primarily python for myself, but if it were purely manipulation of text-based data set, serving as an data interface for R, I would strongly suggest perl, given its powerful regular expression and flexibility in dealing with text. — nye17, Sep 20 '11 at 03:37
you might find the example of combining R and perl presented [here](http://community.moertel.com/~thor/talks/pgh-pm-perl-and-r.pdf) interesting. — nye17, Sep 20 '11 at 03:46
I want to know what we can do with perl/python/ruby/sed/awk etc. but cannot do with R. — kohske, Sep 20 '11 at 04:19
Python would typically **not** pull all of the data into memory unless you explicitly do so. — donkopotamus, Sep 20 '11 at 04:45
@kohske It's not so much what can or can't be done, but how easily it's done. Each of them is strong and each is weak for a set of use cases. For example, R is great for interactive data manipulation, but I would not use it to, say, build a large scale data integration and filtering pipeline... but it could be made to do so. — Reece, Sep 20 '11 at 16:38

score 15 · Answer 1 · answered Sep 20 '11 at 03:33

Not to spoil your adventure, but I'd say no and here is why:

R is vectorised where sed/awk are not
R already has both Perl regular expression and extended regular expressions
R can more easily make recourse to statistical routines (say, imputation) if you need it
R can visualize, summarize, ...

and most importantly: you already know R.

That said, of course sed/awk are great for small programs or even one-liners and Python is a fine language. But I would consider to also stick with R.

I don't think he's considering dropping R, but rather supplementing it. — Karl, Sep 20 '11 at 03:43

score 10 · Answer 2 · answered Sep 20 '11 at 04:15

10

I use Python and Perl regularly. I know sed fairly well and once used awk a lot. I've used R in fits and spurts. Perl is the best of the bunch for data transformation function and speed.

Perl can do essentially everything sed and awk can do, but lots more as well. (In fact, a2p and s2p, which come with perl, convert awk and sed scripts to Perl.)
Perl is included with most Linux/Unix systems. When that wasn't the case, there was good reason to learn sed and awk. That reason is long dead.
Perl has a rich set of modules that provide much more power than one can get from awk or sed. For example, these modules enable one-liners that reverse complement DNA sequences, compute statistics, parse CSV files, or calculate MD5s. (see http://cpan.org/ for packages)
Perl is essentially as terse as sed and awk. For people like me (and, I suspect, you), quickly transforming data on the command line is a great boon. Python's too wordy for efficient command line use.

I'm honestly at a loss to think why one would learn sed and awk over Perl.

For the record, I'm not "a Perl guy". I like it as a swiss army knife, not as a religion.

answered Sep 20 '11 at 04:15

Reece

7,616
4
30
46

2

+1 for a through comparison from a fair coding background. – nye17 Sep 20 '11 at 04:20
+1 for Perl. Although Python might be more readible, Perl beats it any time on speed and compactness. And the command line options are indeed a blessing. – Joris Meys Sep 20 '11 at 11:28
The 'every Unix system has Perl' argument applies even more so to sed and awk, and these two are easier to get hold off if you need (shudder) to work on Windoze. And gets us back to my 'just use R' as Charlie would clearly have R on Windows. I used to write lots of data filters in Perl but switched entirely to R. – Dirk Eddelbuettel Sep 20 '11 at 15:15
Generally a good answer, but the question did list simplicity as a consideration. While Perl is unquestionably more powerful, if sed/awk does all that he needs, that could be a reason why "one would learn sed and awk over Perl." – user287424 May 07 '13 at 20:45

score 6 · Answer 3 · answered Sep 20 '11 at 15:56

6

I would recommend sed/awk along with the wealth of other command line tools available on UNIX-alike platforms: comm, tr, sort, cut, join, grep, and built in shell capabilities like looping and whatnot. You really don't need to learn another programming language as R can handle data manipulation as well as if not better than the other popular scripting languages.

answered Sep 20 '11 at 15:56

Jeff

1,426
8
19

Jeff's got a good point: when glued together with pipes, command line tools like the ones he mentions enable very fast and powerful slicing and dicing of data. Perl complements (rather than supplants) many of of these tools. See the GNU coreutils manual at http://www.gnu.org/s/coreutils/manual/html_node/index.html for for a summary. – Reece Sep 20 '11 at 16:23
And better still, R can play with the same pipes if you use the fabulous `r` binary from the littler package by Jeff and Dirk. So back to using R :) – Dirk Eddelbuettel Sep 20 '11 at 16:24
The asker didn't specify a platform, but this may not be such a good approach for the majority of the population who are on Windows. The transfer of Unix tools to the Windows environment has not been without problems. – user287424 May 07 '13 at 20:55

score 3 · Answer 4 · answered Sep 20 '11 at 03:42

3

I would recommend investing for the long term with a proper language for processing data files, like python or perl or ruby, vs the short term sed/awk solution. I think that all data analysts need at least three languages; I use C for hefty computations, perl for processing data files, and R for interactive analysis and graphics.

I learned perl before python had become popular. I've heard great things about ruby so you might want to try that instead.

For any of these you can work with files line-by-line; python doesn't need to read the full file in advance.

answered Sep 20 '11 at 03:42

Karl

2,009
15
14

2

Sure, with the caveat that 'C++ may be a better C than C' and similarly, Python fans argue that is better than Perl. But as a general rule, know 'R, *a* scripting language and *modern portable compiled language*' is a good recipe. – Dirk Eddelbuettel Sep 20 '11 at 15:19
@DirkEddelbuettel Indeed, I'm stuck in the late 90s, programming-wise; I fear that students will view me the way that I view Fortran programmers. – Karl Sep 20 '11 at 17:29

score 1 · Answer 5 · answered Sep 20 '11 at 03:53

I would recommend 'awk' for this type of processing.

Presumably you are just searching/rejecting invalid observations in simple text files.

awk is lightning fast at this task and is very simple to program.

If you need to do anything more complex then you can.

Python is also a possibility if you don't mind the performance hit. The "rpy" library can be used to closely integrate the python and R components.

Matt Bannert · Answer 6 · 2011-09-20T17:16:20.817

1

I agree with Dirk. I thought about the same thing and used other languages a bit, too. But in the end I was surprised again again what more experienced users do with R. Packages like ddply or plyrmight be very interesting to you. That being said SQL helped me with data juggling often

edited Sep 20 '11 at 17:16

answered Sep 20 '11 at 11:16

Matt Bannert

27,631
38
141
207

Python or awk/sed for cleaning data

6 Answers6