-1

I have a list with the following output when I print a list:

['/dbfs/mnt/abc/date=20210225/fsp_store_abcxyz_lmn_', '/dbfs/mnt/abc/date=20210225/fsp_store_schu_lev_bsd_s_']

Our requirement is:

fsp_store_abcxyz_lmn_
fsp_store_schu_lev_bsd_s_

Could you please help in how to attain the requirement from the list.

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
batman_special
  • 115
  • 1
  • 2
  • 10
  • 3
    `[i.split('/')[-1] for i in lst]`? – Mykola Zotko Feb 25 '21 at 08:18
  • @MykolaZotko yes it does answer my question thank you so much and also could you please help me like if can also have the source as well int he output like Source Extract /dbfs/mnt/abc/date=20210225/fsp_store_abcxyz_lmn_ fsp_store_schu_lev_bsd_s_ Extracted fsp_store_abcxyz_lmn_ fsp_store_schu_lev_bsd_s_ – batman_special Feb 25 '21 at 10:20
  • @MykolaZotko can we get the source and extracted output in two different columns as a tabular format please – batman_special Feb 25 '21 at 11:11
  • @batman_special Updated [my answer](https://stackoverflow.com/a/66365139/941531) to output in tabular form. – Arty Feb 25 '21 at 11:16
  • @MykolaZotko could you please suggest like what needs to be done in such a requirement – batman_special Feb 25 '21 at 12:09
  • @batman_special I would create a new question with an example dataframe and desired output dataframe. – Mykola Zotko Feb 25 '21 at 12:56
  • @MykolaZotko We already solved this question, we chatted for a long time and finally got desired result. batman_special just forgot to Accept my answer as correct. – Arty Feb 25 '21 at 12:58
  • @MykolaZotko The main problem was to apply my simple solution to `pyspark`, it has a special implementation of all Python functions on Spark platform. So it wasn't so easy to get correct results without extra stuff. – Arty Feb 25 '21 at 12:59

2 Answers2

1

Example of solving your task using str.rpartition(). I had to reimplement Max() and LJust() functions because you have pyspark and it has different implementations for built-ins max() and str.ljust().

After running my code you can use res2 or res3 in your code further. res2 contains all rows in format [source, extracted] and res3 contains just extracted values.

Try it online!

def Max(l):
    m = None
    for e in l:
        if m is None or e > m:
            m = e
    return m

def LJust(s, n):
    return s if len(s) >= n else s + ' ' * (n - len(s))

l = [
    '/dbfs/mnt/abc/date=20210225/fsp_store_abcxyz_lmn_',
    '/dbfs/mnt/abc/date=20210225/fsp_store_schu_lev_bsd_s_',
]
res = [e.rpartition('/')[-1] for e in l]
res2 = [[e0, e1] for e0, e1 in zip(l, res)]
maxl = Max([len(e) for e in l])
print('Source'.ljust(maxl) + '    Extracted')
print('\n'.join([LJust(s, maxl) + '    ' + d for s, d in res2]))
res3 = [e1 for e0, e1 in res2]

Output:

Source                                                   Extracted
/dbfs/mnt/abc/date=20210225/fsp_store_abcxyz_lmn_        fsp_store_abcxyz_lmn_
/dbfs/mnt/abc/date=20210225/fsp_store_schu_lev_bsd_s_    fsp_store_schu_lev_bsd_s_
Arty
  • 14,883
  • 6
  • 36
  • 69
  • can we also get the source we are passing the corresponding extracted output – batman_special Feb 25 '21 at 10:22
  • @batman_special Don't understand what you mean. You want me to make util function out of my code? – Arty Feb 25 '21 at 10:35
  • i mean in the output we also get the source as in like – batman_special Feb 25 '21 at 10:36
  • @batman_special I modified my answer to show source and extracted version. Is it what you want? – Arty Feb 25 '21 at 10:39
  • yes @Arty is that work and can it be kept in tabular format ...like source column and extracted column – batman_special Feb 25 '21 at 10:55
  • @batman_special Updated my answer to output in tabular form. – Arty Feb 25 '21 at 10:59
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/229197/discussion-between-batman-special-and-arty). – batman_special Feb 25 '21 at 11:21
  • from pyspark.sql.types import StringType val = [i.split('/')[-1] for i in list] df = spark.createDataFrame(val, StringType()).display() i am using this and then i get only the fsp_store_abcxyz_lmn_ fsp_store_schu_lev_bsd_s_ what i need is with the corresponding source file the extracted file too in tabular format – batman_special Feb 25 '21 at 11:34
  • @batman_special Accept and UpVote buttons are located right at the top of my answer, on the left side at the top of answer. Accept looks like Check-Mark. Up-Vote looks like upper arrow. Would be nice if you click both, so that I get some scores for my work. As we figured out, your task is now solved correctly by my code on pyspark. Look [at image](https://i.stack.imgur.com/rbfH3.png) to see where this buttons are located, I put a red box around buttons. – Arty Feb 25 '21 at 12:41
0

assuming that your list items are paths:

import os

paths = [
  '/dbfs/mnt/abc/date=20210225/fsp_store_abcxyz_lmn_', 
  '/dbfs/mnt/abc/date=20210225/fsp_store_schu_lev_bsd_s_'
]

for path in paths:
    print(os.path.basename(path))

out

fsp_store_abcxyz_lmn_
fsp_store_schu_lev_bsd_s_

os.path.basename: https://docs.python.org/3/library/os.path.html#os.path.basename

this function is very good because you dont have to specify the path separator. [\\ or /]

alexzander
  • 1,586
  • 1
  • 9
  • 21
  • 1
    Better to use [`pathlib`](https://docs.python.org/3/library/pathlib.html#pathlib.Path) for this: `print(Path(path).name)` – Tomerikoo Feb 25 '21 at 08:58
  • there is no concept like `best answer` for this question. everyone will use what they are more `accommodated` to. some guys will use `str.split`, some 'os.path', some `pathlib`, depends on perspective and choice. – alexzander Feb 25 '21 at 09:23
  • Allow me to disagree. It's not just a matter of taste. Using string methods for working with paths is simply not ideal and can lead to problems. In its core, `os.path` treats paths like simple strings – Tomerikoo Feb 25 '21 at 09:56