2

I have a data.anno file, composed of 6677 rows and 33 columns. As an example, in the first image you can see some of the rows of the data.anno file.

2953 of the rows contain "present" in the 10th column. I want to obtain a new file like the original, but without the rows that contain "present" in the 10th column. I've tried with this:

awk '$10!="present"' data.anno >> data_output.anno

but I encountered a problem: the output file I've obtained still contains two rows with "present" in the 3rd column, while the other 2951 rows containing "present" in the 10th column have correctly disappeared. Do you have any idea why this happens? Do you think there are better way to obtain the output file I need?

In the second image you can see the two rows containing "present" that are still present in the output file after using awk. In the third image you can see some of the 2951 rows containing "present" that have correctly disappeared after using awk.

some of the rows of the data.anno file

rows containing "present" that are still present in the output file after using awk

some of the 2951 rows containing "present" that have correctly disappeared after using awk

Nuthatch92
  • 91
  • 7
  • 1
    I believe your 3rd field might have some whitespace or hidden char. Try `awk '$3~\Italy\{next}'`. This should print only if your line doesn't have Italy in the third column. Check also for uppercase/lowercase differences. – Daemon Painter Apr 30 '20 at 08:00
  • @DaemonPainter ITYM `/Italy/`, not `\Italy\ `. Nuthatch - the script you posted is correct for the input you posted so your real input must not look like the input you posted and so there's not much we can do to help you with that. Good luck! – Ed Morton Apr 30 '20 at 12:50
  • 1
    @EdMorton, correct, thanks for spotting it – Daemon Painter Apr 30 '20 at 13:08
  • @DaemonPainter, sorry, but the original file comprises more than 6.000 rows, so what I can do is to post in the question only some of the rows, but I'm not sure this will help you solve the problem. The other thing I can do is to provide you the link to the original file, which is publicly accessible. Is the file .anno with 1240K in the Description. Here is the link: [https://reich.hms.harvard.edu/downloadable-genotypes-present-day-and-ancient-dna-data-compiled-published-papers]. Please, let me know if it's better to post just some rows of the original file as an example. – Nuthatch92 Apr 30 '20 at 13:22
  • 1
    No-one expects/wants you to post your full, original file We just need you to create and post a [mcve] that contains a truly representative example of that original file. No images, no links, just a small text file of say, 3-5 rows and 3-5 columns that has the same look/characters in fields, same placement of the fields you care about (start/middle/end of rows) and the same separators and line endings, plus the output you expect given that sample input file. See [ask] if that's not clear. – Ed Morton Apr 30 '20 at 13:36
  • Having said that, I did check the data in the file from the link you provided (actually https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V42/V42.4/SHARE/public.dir/v42.4.1240K.anno - not sure why you had us go to a different page first and have to look for that!) and it does not contain any white space around Italy in any row of that file as @Daemon had suspected it might so that is not the issue. – Ed Morton Apr 30 '20 at 13:55
  • I see you added an image. That's not useful as we can't copy/paste an image to examine characters and test against. As [I mentioned earlier](https://stackoverflow.com/questions/61506069/problems-using-awk-to-delete-a-row-with-a-specific-value-at-a-certain-column/61525105#comment108832206_61506069): "No images, no links, just a small text file...". Again, see [ask]. – Ed Morton Apr 30 '20 at 14:06
  • 1
    @EdMorton great research job! Nevertheless, I had troubles in windows when my file was encoded via Notepad++ with a line terminator that is not "liked" by the specific (g)awk build. So, for me testing against regex instead of exact matches can be a good strategy to test out. – Daemon Painter Apr 30 '20 at 14:12
  • 1
    @DaemonPainter I didn't say anything about using regexps or not, I was just eliminating the possibility that a surrounding white space around the country name, which was a good guess at what the problem might be, was actually the problem so we can focus on other possible issues. The trouble with files created in windows you're experiencing is probably the one I covered at https://stackoverflow.com/q/45772525/1745001. – Ed Morton Apr 30 '20 at 14:15

1 Answers1

1

Your real input file, which has the countries in the 13th column, is tab-separated and has some fields that contain blanks so you need to set FS to tab:

awk -F'\t' '$13 != "Italy" file

otherwise rows that have fields before $13 that contain blanks will be treated as multiple fields and then Italy won't be in the 13th field it'll be in the 14th or later.

Here's what's happening using a more truly representative sample input file that has tab-separated fields (the cat -T is just to make the tabs visible):

$ cat file
ID      DAY     LOCALITY        OTHER
1       the weekend     Italy   stuff
2       mon     England stuff
3       wed     Italy   stuff
4       the weekend     Italy   stuff
5       sun     England stuff
6       thu     Italy   stuff

$ cat -T file
ID^IDAY^ILOCALITY^IOTHER
1^Ithe weekend^IItaly^Istuff
2^Imon^IEngland^Istuff
3^Iwed^IItaly^Istuff
4^Ithe weekend^IItaly^Istuff
5^Isun^IEngland^Istuff
6^Ithu^IItaly^Istuff

$ awk '$3!="Italy"' file
ID      DAY     LOCALITY        OTHER
1       the weekend     Italy   stuff
2       mon     England stuff
4       the weekend     Italy   stuff
5       sun     England stuff

$ awk -F'\t' '$3!="Italy"' file
ID      DAY     LOCALITY        OTHER
2       mon     England stuff
5       sun     England stuff
Ed Morton
  • 188,023
  • 17
  • 78
  • 185