I use R for data analysis and am very happy with it. Cleaning data could be a bit easier, however. I am thinking about learning another language suited to this task. Specifically, I am looking for a tool to use to take raw data, remove unnecessary variables or observations, and format it for easy loading in R. Contents would be mostly numeric and string data, as opposed to multi-line text.
I am considering the awk/sed combination versus Python. (I recognize that Perl would be another option, but, if I was going to learn another full language, Python seems to be a better, more extensible choice.)
The advantage of sed/awk is that it would be quicker to learn. The disadvantage is that this combination isn't as extensible as Python. Indeed, I might imagine some "mission creep" if I learned Python, which would be fine, but not my goal.
The other consideration that I had is applications to large data sets. As I understand it, awk/sed operate line-by-line, while Python would typically pull all the data into memory. This could be another advantage for sed/awk.
Are there other issues that I'm missing? Any advice that you can offer would be appreciated. (I included the R tag for R users to offer their cleaning recommendations.)