-1

I would like that NC_008934.1 is my key and Glypta fumiferanae ichnovirus segment B18, complete sequence is my value`. Unfortunately, the below code is not working

UPDATE:

from subprocess import Popen, PIPE

p = Popen(
        ["find .  -name \"*.fna\" -exec  grep \">\" '{}' \; | cut -d '|' -f 4,5"],
        stdout=PIPE,
        stderr=PIPE)
result, err = p.communicate()
if p.returncode != 0:
    raise IOError(err)
names = result.strip()
#names has many of this strings NC_008934.1| Glypta fumiferanae ichnovirus segment B18, complete sequence

names_dict = {n[0] : n[1] for n in (nameline.split("|") for nameline in namelines)}
print "!!!", names_dict

Error

python mapped_ids_names.py 
Traceback (most recent call last):
  File "mapped_ids_names.py", line 6, in <module>
    stderr=PIPE)
  File "/work/water/miniconda2/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/work/water/miniconda2/lib/python2.7/subprocess.py", line 1343, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

What did I miss?

user977828
  • 7,259
  • 16
  • 66
  • 117
  • In what way is it not working? Your last line does not seem syntactically correct. – bli Jun 28 '17 at 08:04
  • File "mapped_ids_names.py", line 7 names_dict = {(n[0]:n[1]) n.split("|") for n in names} ^ SyntaxError: invalid syntax – user977828 Jun 28 '17 at 09:26
  • You can't have just a space between your tuple and `n.split`. See my answer for a more syntactically correct `dict` generation code. – bli Jun 28 '17 at 09:38

2 Answers2

0

The following works for me:

nameline1 = "NC_008934.1| Glypta fumiferanae ichnovirus segment B18, complete sequence"
nameline2 = "NC_008934.2| Glypta fumiferanae ichnovirus segment B18, complete sequence 2"
namelines = [nameline1, nameline2]
names_dict = {n[0] : n[1] for n in (nameline.split("|") for nameline in namelines)}

Edit

Based on your comments, it seems that the output of your os.system call is not what you think it is: You obtain the return code of your shell command, which is an int, and not its standard output. You may have more success using the subprocess module:

from subprocess import Popen, PIPE
p = Popen(
        ["find .  -name \"*.fna\" -exec  grep \">\" '{}' \; | cut -d '|' -f 4,5"],
        stdout=PIPE,
        stderr=PIPE,
        shell=True)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
namelines = result.decode("utf-8").strip().split("\n")

To help you adapt this to your needs, see the following links:

Pay special attention to the warning about using shell=True in the last link: Don't use this if the command line is not completely under your control.

Besides, you seem to be parsing fasta files. This can be done using various python libraries. You can see some of them tested here: https://bioinformatics.stackexchange.com/a/380/292

bli
  • 7,549
  • 7
  • 48
  • 94
  • I got Traceback (most recent call last): File "mapped_ids_names.py", line 7, in names_dict = {n[0] : n[1] for n in (nameline.split("|") for nameline in names)} TypeError: 'int' object is not iterable – user977828 Jun 28 '17 at 09:22
  • Traceback (most recent call last): File "mapped_ids_names.py", line 6, in stderr=PIPE) File "/work/water/miniconda2/lib/python2.7/subprocess.py", line 711, in __init__ errread, errwrite) File "/work/water/miniconda2/lib/python2.7/subprocess.py", line 1343, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory – user977828 Jun 28 '17 at 10:35
  • @user977828 Maybe you could update your question with the latest version of your code. This would make the current problem easier to understand. – bli Jun 28 '17 at 10:46
  • My use of Popen is wrong, sorry. I'll try to correct that. Meanwhile, you can have a look at https://docs.python.org/2.7/library/subprocess.html#replacing-shell-pipeline. You can look at `err` to get debugging clues. – bli Jun 28 '17 at 13:29
  • The issue was that when a single string is passed and is composed of more than an executable name, `Popen` needs the `shell=True` option. I updated my answer accordingly. – bli Jun 28 '17 at 13:58
0

When you iterate over a string with for n in names you'll get each character in names, it seems you want to iterate over the lines of names, you can do this by using splitlines().

Besides this, your last line isn't syntactically correct and you probably want to strip() the keys and values to remove any trailing or leading whitespace.

Try this:

names_dict = {pair[0].strip() : pair[1].strip() for pair in (line.split("|") for line in names.splitlines())}

It works for me:

In [1]: names = """NC_008934.1| Glypta fumiferanae ichnovirus segment B18, complete sequence
   ...: NC_008934.2| Glypta fumiferanae ichnovirus segment B18, complete sequence 2
   ...: NC_008934.3| Glypta fumiferanae ichnovirus segment B18, complete sequence 3
   ...: NC_008934.4| Glypta fumiferanae ichnovirus segment B18, complete sequence 4"""

In [2]: names_dict = {pair[0].strip() : pair[1].strip() for pair in (line.split("|") for line in names.splitlines())}

In [3]: names_dict
Out[3]: 
{'NC_008934.1': 'Glypta fumiferanae ichnovirus segment B18, complete sequence',
 'NC_008934.2': 'Glypta fumiferanae ichnovirus segment B18, complete sequence 2',
 'NC_008934.3': 'Glypta fumiferanae ichnovirus segment B18, complete sequence 3',
 'NC_008934.4': 'Glypta fumiferanae ichnovirus segment B18, complete sequence 4'}
Raniz
  • 10,882
  • 1
  • 32
  • 64
  • Traceback (most recent call last): File "mapped_ids_names.py", line 7, in names_dict = {n[0].strip() : n[1].strip() for n in (nameline.split("|") for nameline in names.splitlines)} AttributeError: 'int' object has no attribute 'splitlines' – user977828 Jun 28 '17 at 09:25
  • It seems you've updated your question with a proper way to catch the output of your shell command, is it still not working? – Raniz Jun 28 '17 at 12:23