1

I have a tab-delimited textfile A (representing a BLAST output)

Name1   BBBBBBBBBBBB    99.40   166 1   0   1   166 334 499 3e-82    302
Name2   DDDDDDDDDDDD    98.80   167 2   0   1   167 346 512 4e-81    298

and a textfile B (representing a phylogenetic dendrogram) looking like

"Cluster A": {
        "member": {
            "Cluster A": "BBBBBBBBBBBB This is Animal A", 
                   }, 
        "name": "Cluster A"
             }, 
    "Cluster B: {
        "member": {
            "Cluster B": "DDDDDDDDDDDD This is Animal B"
                   }, 
        "name": "cluster B"
                 }

I want to take the string found in the 2nd tab of textfile A (DDDDDDDDDDDD for example) and look it up in text file B. The script should then add the info found in textfile B into a new tab of textfile A :

Name1   BBBBBBBBBBBB    99.40   166 1   0   1   166 334 499 3e-82    302 Cluster A This is Animal A
Name2   DDDDDDDDDDDD    98.80   167 2   0   1   167 346 512 4e-81    298 Cluster B This is Animal B

Thank you very much!

nouse
  • 3,315
  • 2
  • 29
  • 56
  • Not sure if that is just c&p errors, but is the second one supposed to be json and just full of syntax errors? (missing closing }, single " in the middle) In that case you could try paring it to a dict, which makes accessing and comparing a lot easier – gentoomaniac Mar 04 '15 at 13:38
  • Its indeed a json, but i have never worked with this format before. – nouse Mar 04 '15 at 13:42
  • In that case look here for some basics http://stackoverflow.com/questions/20199126/reading-a-json-file-using-python You'll end up with a dict which gives you easy access to the member fields. A simple cuting the desired string out of the BLAST output and searching for it in the "Cluster.*" fields should do the trick – gentoomaniac Mar 04 '15 at 13:56
  • The JSON format is not adding any value. A simple tab-delimited file would be easier to work with. – glenn jackman Mar 04 '15 at 14:13
  • @glennjackman the point is that the second input is already in json. – gentoomaniac Mar 04 '15 at 14:16

3 Answers3

0

Some sample code which reads the data from two files Your example is missing the outer {} which would fail in parsing that's why the code is adding it.

It then loops the cluster members and construct your desired result

import json                                                                 
import re                                                                   

with open("in1") as blast:                                                  
    blast_data = blast.readlines()                                             

with open("in2") as jsonfile:                                                  
    json_data = json.loads("{%s}" % jsonfile.read())                           

for bdata in blast_data:                                                       
    id = bdata.split()[1]                                                      
    for cluster in json_data:                                                  
        for member in json_data[cluster]['member']:                            
            if id in json_data[cluster]['member'][member]:                     
                print "%s %s %s" % (bdata.strip(), member, re.sub(id, '', json_data[cluster]['member'][member]))
                break
gentoomaniac
  • 111
  • 5
  • File "blast.py", line 8, in json_data = json.loads("{%s}" % jsonfile.read()) File "/opt/qiime-1.8.0/python-2.7.3-release/lib/python2.7/json/__init__.py", line 326, in loads return _default_decoder.decode(s) File "/opt/qiime-1.8.0/python-2.7.3-release/lib/python2.7/json/decoder.py", line 366, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/opt/qiime-1.8.0/python-2.7.3-release/lib/python2.7/json/decoder.py", line 382, in raw_decode obj, end = self.scan_once(s, idx) ValueError: Expecting property name: line 1 column 1 (char 1) – nouse Mar 04 '15 at 14:56
  • Something wrong with our python? its embedded in another software suite – nouse Mar 04 '15 at 14:57
  • I'd rather think the format of the input file is broken (maybe in combination with the code). can you somehow validate that it parses correctly or show me a complete sample output? – gentoomaniac Mar 04 '15 at 15:00
0

Fixing up the json file:

$ cat B
[
    { "Cluster A": { "member": { "Cluster A": "BBBBBBBBBBBB This is Animal A" }, "name": "Cluster A" } }, 
    { "Cluster B": { "member": { "Cluster B": "DDDDDDDDDDDD This is Animal B" }, "name": "cluster B" } }
]

Then, a perl solution:

perl -MJSON -MPath::Class -E '
    my $data = decode_json file("B")->slurp;
    $, = "\t";
    for my $line (file("A")->slurp(chomp => 1)) {
        my @F = split /\t/, $line;
        for my $item (@$data) {
            for my $cluster (keys %$item) {
                while (my ($key, $value) = each %{$item->{$cluster}{member}} ) {
                    if ($value =~ /$F[1]\s+(.*)/) {
                        say $line, $cluster, $1;
                    }
                }
            }
        }
    }
'

outputs

Name1   BBBBBBBBBBBB    99.40   166 1   0   1   166 334 499 3e-82   302 Cluster A   This is Animal A
Name2   DDDDDDDDDDDD    98.80   167 2   0   1   167 346 512 4e-81   298 Cluster B   This is Animal B

For kicks, the equivalent Ruby

ruby -rjson -e '
  data = JSON.load File.new("B")
  File.readlines("A").each {|line|
    line.chomp!
    f = line.split("\t")
    data.each {|obj|
      obj.each_key {|cluster|
        obj[cluster]["member"].each_pair {|key, value| 
          if m = value.match(f[1] + "\s+(.*)")
            puts [line, cluster, m[1]].join("\t")
          end
        }
      }
    }
  }
'
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
0

Shell script piece of code,

#!/usr/bin/ksh
awk '{print $2}' file1 > tmpfile
for i in `cat tmpfile`
do
{
aa=`grep -w $i file2`
awk -v out="$aa" -v pattern="$i" ' $2 ~ pattern { print $0"   "out}' file1}
done

awk '{print $2}' file1 > tmpfile -- Takes the pattern from first file and stores in tmp file aa=grep -w $i file2 -- Matches the similar pattern from file 2 and stores the entire line in variable aa awk -v out="$aa" -v pattern="$i" ' $2 ~ pattern { print $0" "out}' file1} -- Appends the string from file2 into its corresponding matching line of file1

Sadhun
  • 264
  • 5
  • 14