1

I'm totally new to scripting - so new in fact that most of what I do script ends up being put in Mac's Automator as 'Run Shell Script.' So please forgive, well, everything.

Basically, I am building a web-corpus. So, I'm downloading .html files from the web, and using textutil to turn them into .txt files. I am then concatenating them as corpus.txt, and 'cleaning' corpus.txt with grep commands to remove things like lines beginning with a number, or ending in a number, or beginning with punctuation, etc.

The problem is that sometimes, the grepping isn't working. For instance, when I try

grep -v ^[0123456789] corpus.txt > corpus2.txt

I still get some lines beginning with numbers in corpus2.txt. Similarly,

awk '!x[$0]++' corpus3.txt > deduped.txt

isn't removing what look to the eye in Textedit to be duplicate lines.

I believe this is something to do with the kind of newlines in the file. My reasoning for this is that what appears as a newline does not when I open the file in Textwrangler. Where the newline should be, there appears to be a space, followed by three invisible things that do not seem to have width. I can't copy them into Terminal, so I can't just replace them with newlines, as far as I can tell.

I've tried saving the converted html files and the concatenated txt file in different UTF encodings with

find temp2 -type f -print0 | xargs -0 -P 4 textutil -convert txt -encoding UTF-32

but this hasn't helped. I can't even figure out how to learn what kinds of newlines are appearing in the text. Basically, my desired end result is that all newlines as seen in Textedit are treated as newlines in grep, awk and sed. Is there a script that could perform this conversion? What exactly do I even need to convert?

Sorry again for my ignorance. I'm a social sciences student, and am certainly not in Kansas anymore.

user2437842
  • 139
  • 1
  • 10
  • If you are really facing problems because of the line ending, you can try to replace all carriage returns to line ending in files. this might help http://stackoverflow.com/questions/800030/remove-carriage-return-in-unix – abasu May 31 '13 at 05:52
  • If you have strange (non-visible) characters in a document you can remove them with "Text"->"Zap gremlins..." in TextWrangler. Choosing "Text"->"Normalize Line Endings" will also make sure that only one type of line ending is used in your document. – Vortexfive Jul 29 '13 at 14:33

0 Answers0