0

How to remove lines with tab in them?

I've a file that looks like this:

0   absinth
Bohemian-style absinth
Bohemian-style or Czech-style absinth (also called anise-free absinthe, or just “absinth” without the “e”) is an ersatz version of the traditional spirit absinthe, though is more accurately described as a kind of wormwood bitters.
It is produced mainly in the Czech Republic, from which it gets its designations as “Bohemian” or “Czech,” although not all absinthe from the Czech Republic is Bohemian-style.

1   acidophilus milk
Sweet acidophilus milk is consumed by individuals who suffer from lactose intolerance or maldigestion, which occurs when enzymes (lactase) cannot break down lactose (milk sugar) in the intestine.
To aid digestion in those with lactose intolerance, milk with added bacterial cultures such as "Lactobacillus acidophilus" ("acidophilus milk") and bifidobacteria ("a/B milk") is available in some areas.
High Activity of Lactobacillus Acidophilus Milk

2   adobo
Adobo
Adobo (Spanish: marinade, sauce, or seasoning) is the immersion of raw food in a stock (or sauce) composed variously of paprika, oregano, salt, garlic, and vinegar to preserve and enhance its flavor.
In the Philippines, the name "adobo" was given by the Spanish colonists to an indigenous cooking method that also uses vinegar, which although superficially similar had developed independent of Spanish influence.

The desired output has lines with tabs removed, i.e. :

Bohemian-style absinth
Bohemian-style or Czech-style absinth (also called anise-free absinthe, or just “absinth” without the “e”) is an ersatz version of the traditional spirit absinthe, though is more accurately described as a kind of wormwood bitters.
It is produced mainly in the Czech Republic, from which it gets its designations as “Bohemian” or “Czech,” although not all absinthe from the Czech Republic is Bohemian-style.

Sweet acidophilus milk is consumed by individuals who suffer from lactose intolerance or maldigestion, which occurs when enzymes (lactase) cannot break down lactose (milk sugar) in the intestine.
To aid digestion in those with lactose intolerance, milk with added bacterial cultures such as "Lactobacillus acidophilus" ("acidophilus milk") and bifidobacteria ("a/B milk") is available in some areas.
High Activity of Lactobacillus Acidophilus Milk

Adobo
Adobo (Spanish: marinade, sauce, or seasoning) is the immersion of raw food in a stock (or sauce) composed variously of paprika, oregano, salt, garlic, and vinegar to preserve and enhance its flavor.
In the Philippines, the name "adobo" was given by the Spanish colonists to an indigenous cooking method that also uses vinegar, which although superficially similar had developed independent of Spanish influence.

I could do the following in python to achieve the same results:

with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
  for line in fin:
    if '\t' in line:
      continue
    else:
      fout.write(line)

But I have millions of lines and it's not that efficient. So i tried this to remove the 2nd row with cut and then remove lines with single character:

$ cut -f1 WIKI_WN_food | awk 'length>1' | less

What is a more pythonic way to get the desired output?

Is there a more efficient way than the cut + awk piping I've shown above?

alvas
  • 115,346
  • 109
  • 446
  • 738

6 Answers6

2

Your code is OK, you could try to optimize looking only in the beginning of the string:

if `\t' not in l[:5]: fout.write(l)

where the length of the substring depends on the max record number, it could do a difference with long strings that don't match, who knows...

Further, you may want to test mawk, grep etc as in

# Edit : the following won't work. it strips also blank lines
# mawk -F"\t" "NF==1"  original > stripped
grep -vF "\t"        original > stripped
sed -e "/\t/d"       original > stripped

to see if it's faster than a python solution.

Testing

On my system, with a file obtained by repeatedly duplicating yours. its size 1,418,973,184 I have approximate times as follows: grep 1.6s, sed 6.4s, python 4.6s. The python run time does not depend measurably on searching on the whole string or on a substring.

Addendum

I tested Jidder awk solution (as given in a comment to the OP) using mawk, my approximate timing is 3.2s. Here, for what it's worth... the winner is grep -vF

Testing transcript

The run times vary by a couple 0.1s between executions, here I'm going to report only one run timing for each command... for close results one can't make a clear decision.

On the other hand, different tools gave results much far apart than the experimental errors, and I think that we can draw some conclusions...

% ls -l original 
-rw-r--r-- 1 boffi boffi 1418973184 Dec  8 21:33 original
% cat doit.py
from sys import stdout
with open('original', 'r') as fin:
  for line in fin:
    if '\t' in line: continue
    else: stdout.write(line)
% time wc -l original 
15731133 original

real    0m0.407s
user    0m0.184s
sys     0m0.220s
% time python doit.py | wc -l
12584034

real    0m5.334s
user    0m4.880s
sys     0m1.428s
% time grep -vF "       "  original | wc -l
12584035

real    0m1.527s
user    0m1.112s
sys     0m1.400s
% time grep -v "        "  original | wc -l
12584035

real    0m1.556s
user    0m1.120s
sys     0m1.436s
% time sed -e "/\t/d"  original | wc -l
12584034

real    0m6.481s
user    0m6.104s
sys     0m1.404s
% time mawk '!/\t/'  original | wc -l
12584035

real    0m3.059s
user    0m2.608s
sys     0m1.488s
% time gawk '!/\t/'  original | wc -l
12584035

real    0m9.148s
user    0m8.680s
sys     0m1.468s
% 

My example file has a truncated last line, hence the by-one difference in line counts between python and sed on one side, ans all the other tools.

gboffi
  • 22,939
  • 8
  • 54
  • 85
  • What happened to mawk ? Use the command in my comment on the question instead :) –  Dec 08 '14 at 21:02
  • My solution was broken, I have tested yours and I was updating my testing section when I saw your comment... stay tuned – gboffi Dec 08 '14 at 21:04
  • 1
    Years ago I was led to think that `mawk` is faster than `gawk`... many years later, is this still true? Well, testing Jidder's solution with gawk gives me 9.3s. – gboffi Dec 08 '14 at 21:10
  • Why are you using the `-F` flag for grep in this case - does it improve speed of execution or something else? – Ed Morton Dec 08 '14 at 21:21
  • I've been possibly mislead by the man page, testing says there's no measurable difference between the two versions. – gboffi Dec 08 '14 at 21:24
  • Is `mawk -F"\t" 'NF < 2' original` any faster than Jiddler's comment? It seems to me that calling `'!/\t/'` is effectively doing double work as mawk has to parse the whole line according to `FS` prior to running pattern tests. – n0741337 Dec 08 '14 at 22:40
  • @n0741337 it always parses the whole line, there is no way not to, yours will also search for `\t` as it looks for it to separate fields.Even if it did work the way you say, it would still be quicker my way as it would short circuit after finding the first tab, whereas yours will look for more to separate fields on. TBH though i doubt it would make any difference to the speed either way. –  Dec 08 '14 at 22:45
  • @Jidder - I agree that you can't escape the cost of mawk parsing every line by the `FS`. I'm curious if restricting the standard parse from any white-space to just tabs and then using the `NF` field resulting from that instead of standard field parsing and then reloading the `$0` field to retest each line for no tabs specifically is faster. Even if your pattern short-circuits, a pattern that re-examines `$0` seems slower to me. No offense meant, I think your solution was particularly clever, but I'm interested in speed differences on large files and gboffi already has a test suite set up. – n0741337 Dec 08 '14 at 23:29
  • @n0741337 using anything other than the default FS should be noticeably slower for splitting the record into fields than using the default FS since awk is optimized for running with it's default settings. I do wonder though if the numeric comparison of `NF<2` would be enough faster than the RE comparison of `!/\t/` to make the overall time faster. I've got to believe that awk is highly optimized for RE comparisons though since that's it's bread and butter job so my money would be on just using the default FS and testing for `!/\t/` being faster. I also wonder about `/\t/{next}` :-). – Ed Morton Dec 09 '14 at 00:12
  • 1
    @EdMorton - I had time waiting for other data loads, so I made a ~1.7mil line file to test with and it appears (on an Unbuntu guest VM) using my command is ~17-20% slower than Jiddler's. The difference is even more noticeable using `awk`/`nawk`/`gawk` - like 4x worse :) The defaults *are* magic. Good to know. – n0741337 Dec 09 '14 at 00:53
1

You can do this with sed

sed '/\t/d' 'my_file'

look fot "\t" and delete lines that have it

repzero
  • 8,254
  • 2
  • 18
  • 40
0
grep -v '\t' file

............

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Would be much better if you added a comment on what this means – Emil Vikström Dec 08 '14 at 20:52
  • 1
    No, it would be worse. The command couldn't be any simpler or more basic and anyone who doesn't know what it means can figure it out in about 30 seconds flat with a much-needed trip to the grep man page so they get the benefit of an answer and info on how to figure out answers for themselves in future. – Ed Morton Dec 08 '14 at 21:11
0

Try to use grep with Perl-style regular expression:

grep -vP "\t" file.in > file.out
0

Try if using filter gives you an advantage

with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
    fout.write(''.join([line for line in filter(
             lambda l: r'\t' not in l, fin.readlines())]))

Test if the condition r'\t' not in l works with your file. You may need to test for a set of spaces instead of \t, perhaps with regex. I had to hand code the \t into my file.txt file for the code to work. That is why I tried with regex instead, doing substitution:

import re

with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
    fout.write(re.sub(r'^\d+\s{2,}[^\n]+', '', fin.read(), count=0, flags=re.M))

Only now I get an empty line instead of the line you want to eliminate.

GOT IT: the regex need a \n at the end to work:

    fout.write(re.sub(r'^\d+\s{2,}[^\n]+\n', '', fin.read(), count=0, flags=re.M))
chapelo
  • 2,519
  • 13
  • 19
-1

you can try it with tr

tr -d " \t" < tabbed-file.txt > sanitized-file.txt

man tr

tr - translate or delete characters

--

you can also try it with

To remove all whitespace, including tabs from left to first word, issue:

echo " This is a test" | sed -e 's/^[ \t]*//'

unixmiah
  • 3,081
  • 1
  • 12
  • 26
  • 1
    This just deletes the tab characters themselves. – Etan Reisner Dec 08 '14 at 20:09
  • 1
    The second one won't work because there are characters at the start of the line.Maybe `sed 's/.*\t.*//'` would work better but etan reisners in the comment on the question is the best way to do it with sed. –  Dec 08 '14 at 20:23