Python, string slicing (getting file names from a list of file locations)

Question

I am trying to get the files names from a list of file locations. Thinking it involves string slicing.

The one I worked out is:

L = ['C:\\Design\dw\file4.doc',
'C:\\light\PDF\downloads\list.doc',
'C:\\Design\Dq\file4g.doc',
'C:\\Design\Dq\file4r.doc',
'C:\\Design\Dq\file4k.doc',
'C:\\Design\Dq\ole.doc',
'C:\\GE\easy\file\os_references(9).doc',
'C:\\mate\KLO\Market\BIZ\KP\who\Documents\REF.doc']

LL = []

for a in L:
    b = a.split('\')
    for c in b:
        if c.endswith('.doc'):
            c.replace('.doc', '')
            LL.append(c)

print LL

question 1: the output still contains ‘.doc’. why, and how can I have them removed?

question 2: what’s the better way to get the file names?

Thanks.

You should strongly consider adding `r` in front of your path strings (`r'…'`) in order to make them raw strings: this is probably what you intended. You can check my answer in order to get details (and a working, simple solution!). — Eric O. Lebigot, Oct 28 '14 at 08:39
Your file names contain "\" path separators but your code contains "/": is this what you intended? Furthermore, are your processing paths on the same machine as where they were produced? This matters if the path separator convention is not the same between both. — Eric O. Lebigot, Oct 28 '14 at 08:41

score 2 · Accepted Answer · edited Oct 29 '14 at 11:47

The answer to the first question is that strings are immutable, .replace() doesn't modify the string in place, viz:

blaize@bolt ~ $ python 
>>> s = "foobar"
>>> s2 = s.replace("o", "x")
>>> print s
foobar
>>> print s2
fxxbar

My answer to the second question follows:

# I use ntpath because I'm running on Linux.
# This way is more robust if you know you'll be dealing with Windows paths.
# An alternative is to import from os.path then linux filenames will work 
# in Linux and Windows paths will work in Windows.
from ntpath import basename, splitext

# Use r"" strings as people rightly point out.
# "\n" does not do what you think it might.
# See here: https://docs.python.org/2.0/ref/strings.html.
docs = [r'C:\Design\dw\file4.doc',
        r'C:\light\PDF\downloads\list.doc',
        r'C:\Design\Dq\file4g.doc',
        r'C:\Design\Dq\file4r.doc',
        r'C:\Design\Dq\file4k.doc',
        r'C:\Design\Dq\ole.doc',
        r'C:\Design/Dq/test1.doc',  # test a corner case
        r'\\some_unc_machine\Design/Dq/test2.doc',  # test a corner case
        r'C:\GE\easy\file\os_references(9).doc',
        r'C:\mate\KLO\Market\BIZ\KP\who\Documents\REF.doc']

# Please use meaningful variable names:
basenames = []

for doc_path in docs:

    # Please don't reinvent the wheel.
    # Use the builtin path handling functions.
    # File naming has a lot of exceptions and weird cases 
    # (particularly on Windows).
    file_name = basename(doc_path)
    file_basename, extension = splitext(file_name)
    if extension == ".doc":
        basenames.append(file_basename)

print basenames

Best of luck mate. Python is an excellent language.

That's exactly my answer from 2 hours before yours, so I aprove. :) Maybe my answer does not show (it currently has -3 votes!). — Eric O. Lebigot, Oct 28 '14 at 13:45
Mr. EOL, no worries. your contribution is widely recognized. :) — Mark K, Oct 29 '14 at 04:45
Yeah EOL is right.. I answered the last part and then reread the question and realized there was a first part to the question. No idea why your answer was shot down EOL. Stackoverflow is fickle. It's back up to 0 fwiw. — demented hedgehog, Oct 30 '14 at 00:13

kylieCatt · Answer 2 · 2014-10-28T06:37:21.770

0

[file.split('\\')[-1].split('.')[0] for file in L]

You're actually not doing any slicing in your example. You are splitting and replacing. Since we know the file name and extension will always be the last part of a path we can use a negative index to access it after splitting.

Once we split again on the period the file name will always be the 0th element so we can just grab that and add it to a list.

EDIT: I just noticed that this method will have problems with paths that contain \f since this is a special Python character.

edited Oct 28 '14 at 06:37

answered Oct 28 '14 at 06:27

kylieCatt

10,672
5
43
51

Downvoting because Python has a dedicated standard tool just for this. – Eric O. Lebigot Oct 28 '14 at 07:26
@EOL downvote only if the answer is wrong. If you provide an answer with a dedicated tool, you got accept and upvotes automatically and this makes other answers to get down(not downvote). – Avinash Raj Oct 28 '14 at 07:32
@AvinashRaj While I understand your point, I do think that it is wrong to not use the standard tools in Python: this obscures the intent of the code in a much unnecessary way, and this reinforces bad habits. – Eric O. Lebigot Oct 28 '14 at 07:37
@IanAuld: Python has file path handling tools in `os.path` (here: `basename` and `splitext` do the job for you, in a more legible and efficient way). – Eric O. Lebigot Oct 29 '14 at 07:20

score 0 · Answer 3 · edited Oct 29 '14 at 07:18

0

try this if there is no space or other symbols in filename

[re.findall('\w+.doc$', L) for x in L]

Try to take a look at

ntpath module

edited Oct 29 '14 at 07:18

Eric O. Lebigot

91,433
48
218
260

answered Oct 28 '14 at 06:29

Hackaholic

19,069
5
54
72

The `\w+` does not match one of the file names (the one with `(9)`). – Eric O. Lebigot Oct 28 '14 at 13:50
Given in description, no space and symbol. And i also told u to take to look at ntpath. Down voting is not a solution – Hackaholic Oct 28 '14 at 18:06

score 0 · Answer 4 · edited Oct 28 '14 at 13:46

0

First thing replace method returns the string with the replaced value. It does not changes the string. So you need to do

c = c.replace('.doc', '')

edited Oct 28 '14 at 13:46

Eric O. Lebigot

91,433
48
218
260

answered Oct 28 '14 at 06:32

himanshu shekhar

192
4

1

This is not very efficient or robust: not efficient because removing an extension can be done very fast by *starting from the end of the string*, which `replace` cannot do, and not robust because it breaks with unconventional file names (like `my.doc.doc`)—which matters because there are simple and robust solutions. :) – Eric O. Lebigot Oct 28 '14 at 13:48

score 0 · Answer 5 · edited May 23 '17 at 12:27

First answer: replace returns a copy of string, so you doesn't save your changes.
Second answer: You need to get the raw representation of several of the paths because combinations like '\f' are interpretated as an utf-8 char.
So the tricky part is format the strings to its raw representation. For this i've used the raw() of this answer
Once we have this function, we can manipulate well the strings.
I've used re.split to accept unix and dos format paths

>>> L = [re.split(r'[\/\\]', raw(path)) for path in L]
>>> L
[['C:', 'Design', 'dw', 'file4.doc'], ['C:', 'light', 'PDF', 'downloads', 'list.doc'], ['C:', 'Design', 'Dq', 'file4g.doc'], ['C:', 'Design', 'Dq', 'file4r.doc'], ['C:', 'Design', 'Dq', 'file4k.doc'], ['C:', 'Design', 'Dq', 'ole.doc'], ['C:', 'GE', 'easy', 'file', 'os_references(9).doc'], ['C:', 'mate', 'KLO', 'Market', 'BIZ', 'KP', 'who', 'Documents', 'REF.doc']]

Now L contains a list of path parts, so you can access to file name and its extension getting the last element of every list

>>> L_names = [path_parts[-1] for path_parts in L if path_parts[-1].endswith('.doc')]
>>> L_names
['file4.doc', 'list.doc', 'file4g.doc', 'file4r.doc', 'file4k.doc', 'ole.doc', 'os_references(9).doc', 'REF.doc']

Downvoting because Python has a dedicated standard tool just for this. — Eric O. Lebigot, Oct 28 '14 at 07:25
The problem here is to get the raw representation of the string, you have this mistake with \f. And the question needs the filename, and your answer doesn't provide them — xecgr, Oct 28 '14 at 07:33
You are right to point out the problem of the string input. I believe that the original question should contain *raw* strings, though, and I understand that the poster mostly want to know how to get the file names (with no extension). — Eric O. Lebigot, Oct 28 '14 at 08:35

Eric O. Lebigot · Answer 6 · 2014-10-28T08:36:34.963

-3

The first important point is that you should input your list with raw string (r prefix):

L = [r'C:\\Design\dw\file4.doc',
     r'C:\\light\PDF\downloads\list.doc',
     …]

Otherwise, characters are interpolated, in your file names (\… is generally replaced by a single character).

Python 2 has a dedicated sub-module just for manipulating paths, which gives you the expected result:

from os.path import basename, splitext                                          
print [splitext(basename(path))[0] for path in L]

Note that the paths and this script must be run on systems that use the same path separator (/ or \) convention (which should usually be the case, as paths generally make sense locally on a machine). You can make it work specifically for Windows path (on any operating system) by doing instead:

from ntpath import basename, splitext

You then get, on any machine:

['file4', 'list', 'file4g', 'file4r', 'file4k', 'ole', 'os_references(9)', 'REF']

edited Oct 28 '14 at 08:36

answered Oct 28 '14 at 07:24

Eric O. Lebigot

91,433
48
218
260

What is the problem? what do you get? – Eric O. Lebigot Oct 28 '14 at 07:38
It looks like you are using a Unix machine for analyzing Windows paths. I added a caveat: paths usually make sense *on a given machine*, so this script works when run on a machine with the same path separator convention as the machine where the paths are from. – Eric O. Lebigot Oct 28 '14 at 07:48
@AvinashRaj: You should try my amended, current solution. – Eric O. Lebigot Oct 28 '14 at 08:36
@AvinashRaj: It looks like you forgot to use the raw string marker (`r`) that the original poster likely forgot, as per my answer. – Eric O. Lebigot Oct 28 '14 at 13:43

Python, string slicing (getting file names from a list of file locations)

6 Answers6