how can I take some info out of text

Question

I have a text files like this

sp|O15304|SIVA_HUMAN MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET IGPDGR
tr|A0A1B1L9R9|A0A1B1L9R9_BACTU MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL NKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWM

I am trying to lower case the third part of the two set. I tried the following but does not work

awk '{ gsub($3, tolower($3)); print $1"\t"$2}'

I have a Mac, is there any other way to do that ?

Can you show what your desired output is? Quite a few people posting different answers because "I am trying to lower case the third part of the two set" is not super clear. Thanks for attempting a solution though! — Ian McGowan, Dec 20 '18 at 20:03
@Ian McGowan I gave output , please see the section that I wrote `the output looks like this ` — Learner, Dec 20 '18 at 20:05
You are not helping here ;-) How about this? Given a string with 3 fields "FIRST MPKIGPDGRLIR IGPDGR" lower case the part of the 2nd field that matches the 3rd, so the output is "FIRST MPKigpdgrLIR"? — Ian McGowan, Dec 20 '18 at 20:17
@Ian McGowan I made it very clear, I hope it is easy now to work with ? — Learner, Dec 20 '18 at 20:26
Added another answer, now the question is a little easier to understand. It's not that your question is difficult, it's that you're doing a horrible job of explaining it ;-) It would also help to simplify to the essentials - those giant strings aren't helping. It seems like the major question is how to pipe the output from one command to another. Using the "|" symbol is the answer to that question. You also should be aware that the gsub function in awk works on the whole string, so if your $3 matches with anything in $1 it will be replaced there too. Try setting $3 to A to see that. — Ian McGowan, Dec 20 '18 at 20:34
You did not make it very clear, friend. "third part of the two set" doesn't parse well in English. You best bet is to manually edit this sample so that you can show what it looks like both *before* and *after* processing, so we can see exactly what change you need. — Paul Hodges, Dec 21 '18 at 14:50

Ian McGowan · Answer 1 · 2018-12-20T20:09:01.247

1

You're splitting on the default awk delimiter to get $1 and $2. Then you need to split $1 on "|" and lowercase the 3rd part of $1?

$awk '{split($1,a,"|") ; print a[1] "|" a[2] "|" tolower(a[3]) "\t" $2 "\t" $3}' test.txt

sp|O15304|siva_human    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET
tr|A0A1B1L9R9|a0a1b1l9r9_bactu  MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL

edited Dec 20 '18 at 20:09

answered Dec 20 '18 at 19:53

Ian McGowan

3,461
3
18
23

Nice answer using awk! But I think OP wants to split on whitespace if I understood it correctly. Either way, why is the `IGDPR` from the end of the line lost? – lucidbrot Dec 20 '18 at 19:58
1

Nice catch! I didn't notice the 3rd field on the end of the example data. This would be a better question if a) the example was trimmed down (it doesn't matter to us what these strings represent) and b) the desired output was shown. I'll edit. – Ian McGowan Dec 20 '18 at 20:05

Paul Hodges · Answer 2 · 2018-12-21T14:26:49.987

Use a read into a variable declared as lowercase.

In all these examples I am printing the sections wrapped in square brackets ([]) so you can see how it's parsing, and I'm just putting spaces between. You can edit all that. The important part is to understand what defines the separations and to get the right part into the variable that will lowercase it.

declare -l three
while IFS='|' read -r one two three
do echo "[$one] [$two] [$three]"
done < infile
[sp] [O15304] [siva_human mpkrscpfadvaplqlkvrvsqrelsrgvcaerysqevfektkrllflgaqayldhvwdegcavvhlpespkpgptgapraargqmligpdgrlirslgqaseadpsgvasiacsscvravdgkavcgqceralcgqcvrtcwgcgsvactlcglvdcsdmyekvlctscamfet igpdgr]
[tr] [A0A1B1L9R9] [a0a1b1l9r9_bactu mnkqlflaslketqksilsyacgaalylwlliwifpsmvsakglneliaampdsvkkivgmespiqnvmdflageyysllfiiiltifcvtvathliarhvdkgamayllatpvsrvqiaitqatvlilglliivsvtyvaglvgaewflqdnnlnkelflkinivggliflvvsaysfffscicnderkalsysasltilffvldmvgklsdklewmknlslftlfrpkeiaegayniwpvsigliagalcifivaivvfkkrdlpl nkelflkinivggliflvvsaysfffscicnderkalsysasltilffvldmvgklsdklewm]

If you only want the part after the pipe, but before the space - and if the format is consistent -

declare -l three
while IFS='| ' read -r one two three four
do echo "[$one] [$two] [$three] [$four]"
done < infile
[sp] [O15304] [siva_human] [MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET IGPDGR]
[tr] [A0A1B1L9R9] [a0a1b1l9r9_bactu] [MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL NKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWM]

If all you want is that LAST bit after the spaces downcased, then the default delimiter is fine.

declare -l three
while read -r one two three
do echo "[$one] [$two] [$three]"
done < infile
[sp|O15304|SIVA_HUMAN] [MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET] [igpdgr]
[tr|A0A1B1L9R9|A0A1B1L9R9_BACTU] [MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL] [nkelflkinivggliflvvsaysfffscicnderkalsysasltilffvldmvgklsdklewm]

can you check the question again? I would absolutely like and accept your answer if it helps me finding the solution. I cannot ask another question because th web does not allow me — Learner, Dec 20 '18 at 22:24
You should be able to edit the question to make it clear. Did you only want the part after the last pipe but before the space? — Paul Hodges, Dec 21 '18 at 14:14

Ian McGowan · Answer 3 · 2018-12-20T20:37:14.533

1

So the question is how to correctly use the 3rd field as a pattern to do a sub in the rest of the string, and also how to send the output of the join to the awk command. Note that the gsub should have a target, in case field 3 is e.g. a single character, and that would also match and replace anything in $1.

join df1.txt df2.txt | awk '{gsub($3, tolower($3), $2) ; print $1 "\t" $2}'

To show an example, with and without the target:

ian@orca:~/tmp$ cat t
sp|O15304|SIVA_HUMAN FALALALALA A

ian@orca:~/tmp$ awk '{gsub($3, tolower($3)) ; print $1 "\t" $2}' t
sp|O15304|SIVa_HUMaN    FaLaLaLaLa

ian@orca:~/tmp$ awk '{gsub($3, tolower($3), $2) ; print $1 "\t" $2}' t
sp|O15304|SIVA_HUMAN    FaLaLaLaLa

edited Dec 20 '18 at 20:37

answered Dec 20 '18 at 20:28

Ian McGowan

3,461
3
18
23

I liked and accept your answer . Although it does not really help me on a huge data but it works on the sample I provided so thanks for your time . I am now trying to figure out what is the problem – Learner Dec 20 '18 at 20:45
Thanks for the internet points! What happens with huge data? Perhaps another question, but this time try to get to the crux of the problem before asking :-) – Ian McGowan Dec 20 '18 at 20:52
1

@ Ian McGowan people already disliked my question, if this happen again , I won't be able to ask question. I can share data here with you and you can see what is going on, I would absolutely like an answer of yours if the problem solved – Learner Dec 20 '18 at 20:55
I pasted a bigger data in my question , but still the code works !!! I really don't know why it does not work on huge data – Learner Dec 20 '18 at 21:01
Don't be discouraged! Take this as feedback - if you a) try and distill the question down to the core, and b) provide a simple example with expected output. You'll get better results. – Ian McGowan Dec 20 '18 at 21:04
it does not allow me to ask any more question. I pasted again the question with exact example. can you look at it ? – Learner Dec 20 '18 at 21:21

score 0 · Answer 4 · answered Dec 20 '18 at 19:53

 sed -rn 's/(.*\s.*\s)(.*)$/\1 \L\2 /p' tmp.txt

Sources:

Explanation:

I do not know awk well and it likely is possible to do this with awk as well. sed takes each line on its own and:

's/    substitutes
(      a group
  .*     containing any characters of any amount
  \s     a whitespace
  .*     again some characters
  \s     again a whitespace
)      and stores that group as \1
(.*)   and puts all the remaining characters in group \2
$      until the end of the line
/      Substitute all of this with:
\1     The first group
       a space (you might not want that. then remove it.
\L\2   The second group in lowercase
/p     and print that

The flag -r is neccessary to enable the capture of groups. The -n flag tells sed not to pring every line by itself already.

Tested on cygwin. Perhaps you need the -e flag on your OS. Perhaps you need to use the POSIX compliant [[:space:]] instead of \s for whitespace.

can you check the question again? I would absolutely like and accept your answer if it helps me finding the solution. I cannot ask another question because th web does not allow me — Learner, Dec 20 '18 at 22:24
@Learner Sure! It looks still somewhat unclear to me. Could you edit your question so that it contains an example output for your example input? What exactly must become lowercase? `SIVA_HUMAN`? — lucidbrot, Dec 21 '18 at 06:15

score -2 · Answer 5 · answered Dec 20 '18 at 19:32

-2

Try something like this:

cat text.txt | cut -d"|" -f3

answered Dec 20 '18 at 19:32

JoseLinares

775
5
15

I want at the same time to lowercase that third part . does it do the job? – Learner Dec 20 '18 at 19:34
I dont think so. – Derviş Kayımbaşıoğlu Dec 20 '18 at 19:35
you need to change text.txt with the name of your file – JoseLinares Dec 20 '18 at 19:37
OP wants to lowercase third part of the source file – Derviş Kayımbaşıoğlu Dec 20 '18 at 19:38
@Simonare I gave an output above – Learner Dec 20 '18 at 19:41
And the [`cat` is useless.](/questions/11710552/useless-use-of-cat) – tripleee Dec 20 '18 at 19:55

how can I take some info out of text

5 Answers5