1

I need to split the document path to the foldername and the document name in python. It is a large dataframe including many rows.For the filename with no document name followed, just leave the document name column blank in the result. For example, I have a dataframe like the follows:

     no  filename
     1  \\apple\config.csv
     2  \\apple\fox.pdf
     3  \\orange\cat.xls
     4  \\banana\eggplant.pdf
     5  \\lucy
...

I expect the output shown as follows:

    foldername  documentname
    \\apple     config.csv
    \\apple     fox.pdf
    \\orange    cat.xls
    \\banana    eggplant.pdf
    \\lucy 
...     

I have tried the following code,but it does not work.


    y={'Foldername':[],'Docname':[]}
    def splitnames(x):
        if "." in x:
            docname=os.path.basename(x)
            rm="\\"+docname
            newur=x.replace(rm,'')
        else:
            newur=x
            docname=""
        result=[newur,docname]
        y["Foldername"].append(result[0])
        y["Docname"].append(result[1])
        return y;

    dff=df$filename.apply(splitnames)

Thank you so much for the help!!

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Cherry
  • 73
  • 6
  • 1
    Can you not split the entire filename string on the backslash (yields a list) and take the last item from the list? Like so: string.split('\')[0]. This would give you 'config.csv' (in the first example). – sander May 07 '20 at 10:53
  • 1
    See answer in this question: https://stackoverflow.com/questions/7304117/split-filenames-with-python `import os; os.path.basename("/path/to/the/filename.csv")` -> `filename.csv` – Tin Nguyen May 07 '20 at 10:55
  • Thanks for the comments! Unfortunately I do need the foldername corresponding to the document name. Is there any way to do that?? – Cherry May 07 '20 at 10:55

3 Answers3

0

Possibly, you shall use apply function twice, to generate separate columns:

import pandas as pd
filenames = [r'\\apple\config.csv', r'\\apple\fox.pdf', r'\\orange\cat.xls', r'\\banana\eggplant.pdf']
df = pd.DataFrame( { 'filename':filenames })
df['Foldername'] = df['filename'].apply( lambda x : r'\\' + x.split('\\')[-2]  )
df['Docname'] = df['filename'].apply( lambda x :  x.split('\\')[-1]  )

Default apply function awaits single value to be created and also in this case it is worth to indicate to which column you want to use it.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

RunTheGauntlet
  • 392
  • 1
  • 4
  • 15
  • 1
    it would be better to use `str.split` here no need to take the performance hit with apply. – Umar.H May 07 '20 at 12:04
  • The problem is I have a large number of pathnames in my original data, so I may not be able to type in every file name into the dataframe "filenames" , they are stored in a csv file. – Cherry May 07 '20 at 16:53
  • And I some filename only includes foldername without documentname followed, in that case I need to copy the foldername to the foldername column and leave the documentname blank as shown in the example above. – Cherry May 07 '20 at 16:59
0

Not sure how you're getting the paths, but you could create some Pathlib objects and use some class methods to grab the file name and folder name.

:

from pathlib import Path

data = """ no  filename
     1  \\apple\\config.csv
     2  \\apple\\fox.pdf
     3  \\orange\\cat.xls
     4  \\banana\\eggplant.pdf
     5  \\lucy"""

df = pd.read_csv(StringIO(data),sep='\s+')
df['filename'] = df['filename'].apply(Path)


df['folder'] = df['filename'].apply(lambda x : x.parent if '.' in x.suffix else x)
df['document_name'] = df['filename'].apply(lambda x : x.name if '.' in x.suffix  else np.nan)


print(df)

   no              filename   folder document_name
0   1     \apple\config.csv   \apple    config.csv
1   2        \apple\fox.pdf   \apple       fox.pdf
2   3       \orange\cat.xls  \orange       cat.xls
3   4  \banana\eggplant.pdf  \banana  eggplant.pdf
4   5                 \lucy    \lucy           NaN
Umar.H
  • 22,559
  • 7
  • 39
  • 74
  • Thank you. The problem is I have large amount of filenames that need to be read from a csv file, some of them are only folder name with no document name followed. In that case, I would like to grap the foldername and leave the documentname blank. That's why I had set a if condition (if ". " in x) to decide whether this string should be splitted into two parts. I have updated the example as above. Could you share any advice how could I resolve that problem? Thank you in advance! – Cherry May 07 '20 at 16:58
  • Is there anyway to use Pathlib to deal with the filename column with both path and path followed by document name, for the path has the extension name grab the foldername and document name, but for the path without extension name just get the foldername and leave the docuement name blank? – Cherry May 12 '20 at 09:00
  • For example, if the filename is \apple\config.csv, the foldername output is \apple and the documentname output is config,csv. But if the filename is \lucy without any documentname followed, in the output the foldername should be \lucy, and documentname should be left blank, as shown in my example posted earlier. Therefore, I need to first make a judgement whether there is a document name. – Cherry May 14 '20 at 09:31
  • Thanks a lot!! I read the filename document from a csv file, when I try to run your code, I got this error: 'str' object has no attribute 'suffix'. Shall I do any data type transfer before running this code? Thank you!! – Cherry May 15 '20 at 09:17
  • @Cherry we first need to create a pathlib object `df['filename'] = df['filename'].apply(Path)` – Umar.H May 15 '20 at 09:20
  • @Cherry we got there eventually :) happy coding. – Umar.H May 15 '20 at 09:28
0

Extension to Umar.H suggestion is to use split under the os lib

df['Docname'] = df['filename'].apply(lambda x : os.path.split(x)[1])
mpx
  • 3,081
  • 2
  • 26
  • 56