Let me first say that I assiduously avoid hand-cleaning data in favor of regular expressions and the like. However, occasionally it is inevitable.
I use something like the Load-Clean-Func-Do workflow normally, so this obviously fits into the cleaning phase. However, any hand-editing breaks the ability to run the stuff before the hand-cleaning if it needs updating.
I can think of at least three ways to handle this:
- Put the by-hand changes as early in the workflow as possible, so that everything after that remains runnable.
- Write out regexes or assignment operations for every single change.
- Use a tool that generates (2) for you after you close the spreadsheet where you've made the changes.
The problem with 2 is that it can be extremely unweildy. The problem with 3 is that I'm unaware of any such tool existing for R. Stata has an extremely good implementation of this.
So the questions are:
- Which results in the most replicable code with the least-frustrating code writing?
- Does a tool as in (3) exist?