0

I have a list containing directories and files, I want to keep only files and write the following code to filter it. However, I found some records are still in this list, like 'wangshx/,' '`.

From this result, I found there may be something wrong about the if sentence. Can anyone point out where the problem is?

In [7]: filelist = ['/tmp/test2.pbs', '/tmp/test.pbs', '/public/home/ 
   ...: wangshx:', ' ', 'correct_order.txt', 'download/', 'filepaths. 
   ...: RData', 'lib/', 'Log.out', 'ncbi_error_report.xml', 'new_hg19 
   ...: .1.bt2', 'new_hg19.2.bt2', 'new_hg19.3.bt2', 'new_hg19.4.bt2' 
   ...: , 'new_hg19.fa', 'new_hg19.rev.1.bt2', 'new_hg19.rev.2.bt2',  
   ...: 'perl5/', 'practice/', 'repnames_nfragments.txt', 'soft/', 's 
   ...: ongmf/', 'sort.pbs', 'test.pbs', 'test.pbs.o1167575', 'test.p 
   ...: bs.o1167590', 'tmp/', 'wangshx/', 'workspace/', 'wt/', 'wx/', 
   ...:  '']

In [8]: len(filelist)
Out[8]: 32

In [9]:    for f in filelist:
   ...:                 print(f)
   ...: 
   ...: 
/tmp/test2.pbs
/tmp/test.pbs
/public/home/wangshx:

correct_order.txt
download/
filepaths.RData
lib/
Log.out
ncbi_error_report.xml
new_hg19.1.bt2
new_hg19.2.bt2
new_hg19.3.bt2
new_hg19.4.bt2
new_hg19.fa
new_hg19.rev.1.bt2
new_hg19.rev.2.bt2
perl5/
practice/
repnames_nfragments.txt
soft/
songmf/
sort.pbs
test.pbs
test.pbs.o1167575
test.pbs.o1167590
tmp/
wangshx/
workspace/
wt/
wx/


In [10]: for f in filelist:
    ...:     print(f)
    ...:     if f[-1]=='/' or f[-1]==':' or f=='' or f==' ':
    ...:         print("=> Should remove " + f)
    ...:         filelist.remove(f)
    ...: 
/tmp/test2.pbs
/tmp/test.pbs
/public/home/wangshx:
=> Should remove /public/home/wangshx:
correct_order.txt
download/
=> Should remove download/
lib/
=> Should remove lib/
ncbi_error_report.xml
new_hg19.1.bt2
new_hg19.2.bt2
new_hg19.3.bt2
new_hg19.4.bt2
new_hg19.fa
new_hg19.rev.1.bt2
new_hg19.rev.2.bt2
perl5/
=> Should remove perl5/
repnames_nfragments.txt
soft/
=> Should remove soft/
sort.pbs
test.pbs
test.pbs.o1167575
test.pbs.o1167590
tmp/
=> Should remove tmp/
workspace/
=> Should remove workspace/
wx/
=> Should remove wx/

In [11]: filelist
Out[11]: 
['/tmp/test2.pbs',
 '/tmp/test.pbs',
 ' ',
 'correct_order.txt',
 'filepaths.RData',
 'Log.out',
 'ncbi_error_report.xml',
 'new_hg19.1.bt2',
 'new_hg19.2.bt2',
 'new_hg19.3.bt2',
 'new_hg19.4.bt2',
 'new_hg19.fa',
 'new_hg19.rev.1.bt2',
 'new_hg19.rev.2.bt2',
 'practice/',
 'repnames_nfragments.txt',
 'songmf/',
 'sort.pbs',
 'test.pbs',
 'test.pbs.o1167575',
 'test.pbs.o1167590',
 'wangshx/',
 'wt/',
 '']

Best,

Shixiang

Shixiang Wang
  • 2,147
  • 2
  • 24
  • 33
  • 2
    Using `pathlib` will simplify this. – Trenton McKinney Oct 29 '19 at 16:48
  • Perhaps take a look at `os.path` or `pathlib` modules. Both have functions for determining whether or not a path is a directory or a file. [os.path docs](https://docs.python.org/3/library/os.path.html) or [pathlib docs](https://docs.python.org/3/library/pathlib.html) – Benjamin Hoving Oct 29 '19 at 16:49
  • 5
    you are removing elements from a list as you iterate over it, this makes the iteration skip the next element after removal. try this to illustrate : `a = [1, 2, 3, 4]` `for i in a: a.remove(i)` `print(a)` – Vilius Klakauskas Oct 29 '19 at 16:50
  • 1
    this might answer your question https://stackoverflow.com/a/1207427/5501462 – m0etaz Oct 29 '19 at 16:58
  • @ViliusKlakauskas Thanks, I used list.copy() to create a same list and it works. – Shixiang Wang Oct 30 '19 at 02:23
  • @BenjaminHoving Thank you. However, the list does not come from the local machine, the paths are from a remote Linux host. – Shixiang Wang Oct 30 '19 at 02:27

2 Answers2

1

Based on your list it looks like you could get away with something as simple as looking for the character . in each string.

Something like this:

filelist = ['/tmp/test2.pbs', '/tmp/test.pbs', '/public/home/wangshx:', ' ', 'correct_order.txt', 'download/', 'filepaths.RData', 'lib/', 'Log.out', 'ncbi_error_report.xml', 'new_hg19.1.bt2', 'new_hg19.2.bt2', 'new_hg19.3.bt2', 'new_hg19.4.bt2' , 'new_hg19.fa', 'new_hg19.rev.1.bt2', 'new_hg19.rev.2.bt2', 'perl5/', 'practice/', 'repnames_nfragments.txt', 'soft/', 'songmf/', 'sort.pbs', 'test.pbs', 'test.pbs.o1167575', 'test.pbs.o1167590', 'tmp/', 'wangshx/', 'workspace/', 'wt/', 'wx/', '']

for f in filelist:
    if '.' in f:
        print(f)
    else:
        print("=> Should remove " + f)

which will output:

/tmp/test2.pbs
/tmp/test.pbs
=> Should remove /public/home/wangshx:
=> Should remove  
correct_order.txt
=> Should remove download/
filepaths.RData
=> Should remove lib/
Log.out
ncbi_error_report.xml
new_hg19.1.bt2
new_hg19.2.bt2
new_hg19.3.bt2
new_hg19.4.bt2
new_hg19.fa
new_hg19.rev.1.bt2
new_hg19.rev.2.bt2
=> Should remove perl5/
=> Should remove practice/
repnames_nfragments.txt
=> Should remove soft/
=> Should remove songmf/
sort.pbs
test.pbs
test.pbs.o1167575
test.pbs.o1167590
=> Should remove tmp/
=> Should remove wangshx/
=> Should remove workspace/
=> Should remove wt/
=> Should remove wx/
=> Should remove 
JavierCastro
  • 318
  • 2
  • 8
1

The problem is editing the list while iterating over it. Use a list comprehension instead. It's not clear what your filter requirements are, but as an example the following builds a new list with anything ending in a slash removed:

filelist = ['/tmp/test2.pbs', '/tmp/test.pbs', '/public/home/wangshx:', ' ', 'correct_order.txt', 'download/',
            'filepaths.RData', 'lib/', 'Log.out', 'ncbi_error_report.xml', 'new_hg19.1.bt2', 'new_hg19.2.bt2',
            'new_hg19.3.bt2', 'new_hg19.4.bt2', 'new_hg19.fa', 'new_hg19.rev.1.bt2', 'new_hg19.rev.2.bt2',
            'perl5/', 'practice/', 'repnames_nfragments.txt', 'soft/', 'songmf/', 'sort.pbs', 'test.pbs',
            'test.pbs.o1167575', 'test.pbs.o1167590', 'tmp/', 'wangshx/', 'workspace/', 'wt/', 'wx/', '']

files = [file for file in filelist if not file.endswith('/')]

print(files)

Output:

['/tmp/test2.pbs', '/tmp/test.pbs', '/public/home/wangshx:', ' ', 'correct_order.txt', 'filepaths.RData', 'Log.out', 'ncbi_error_report.xml', 'new_hg19.1.bt2', 'new_hg19.2.bt2', 'new_hg19.3.bt2', 'new_hg19.4.bt2', 'new_hg19.fa', 'new_hg19.rev.1.bt2', 'new_hg19.rev.2.bt2', 'repnames_nfragments.txt', 'sort.pbs', 'test.pbs', 'test.pbs.o1167575', 'test.pbs.o1167590', '']
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251