3

I currently have a list of filenames in a txt file and I am trying to sort them. The first this I am trying to do is split them into a list since they are all in a single line. There are 3 types of file types in the list. I am able to split the list but I would like to keep the delimiters in the end result and I have not been able to find a way to do this. The way that I am splitting the files is as follows:

import re

def breakLines():
    unsorted_list = []
    file_obj = open("index.txt", "rt")
    file_str = file_obj.read()

    unsorted_list.append(re.split('.txt|.mpd|.mp4', file_str))

    print(unsorted_list)

breakLines()

I found DeepSpace's answer to be very helpful here Split a string with "(" and ")" and keep the delimiters (Python), but that only seems to work with single characters.

EDIT:

Sample input:

file_name1234.mp4file_name1235.mp4file_name1236.mp4file_name1237.mp4

Expected output:

file_name1234.mp4

file_name1235.mp4

file_name1236.mp4

file_name1237.mp4

Community
  • 1
  • 1
Alexiz Hernandez
  • 609
  • 2
  • 9
  • 31

1 Answers1

6

In re.split, the key is to parenthesise the split pattern so it's kept in the result of re.split. Your attempt is:

>>> s = "file_name1234.mp4file_name1235.mp4file_name1236.mp4file_name1237.mp4"
>>> re.split('.txt|.mpd|.mp4', s)
['file_name1234', 'file_name1235', 'file_name1236', 'file_name1237', '']

okay that doesn't work (and the dots would need escaping to be really compliant with what an extension is), so let's try:

>>> re.split('(\.txt|\.mpd|\.mp4)', s)
['file_name1234',
'.mp4',
 'file_name1235',
 '.mp4',
 'file_name1236',
 '.mp4',
 'file_name1237',
 '.mp4',
 '']

works but this is splitting the extensions from the filenames and leaving a blank in the end, not what you want (unless you want an ugly post-processing). Plus this is a duplicate question: In Python, how do I split a string and keep the separators?

But you don't want re.split you want re.findall:

>>> s = "file_name1234.mp4file_name1235.mp4file_name1236.mp4file_name1237.mp4"
>>> re.findall('(\w*?(?:\.txt|\.mpd|\.mp4))',s)
['file_name1234.mp4',
 'file_name1235.mp4',
 'file_name1236.mp4',
 'file_name1237.mp4']

the expression matches word characters (basically digits, letters & underscores), followed by the extension. To be able to create a OR, I created a non-capturing group inside the main group.

If you have more exotic file names, you can't use \w anymore but it still reasonably works (you may need some str.strip post-processing to remove leading/trailing blanks which are likely not part of the filenames):

>>> s = " file name1234.mp4file-name1235.mp4 file_name1236.mp4file_name1237.mp4"
>>> re.findall('(.*?(?:\.txt|\.mpd|\.mp4))',s)
[' file name1234.mp4',
 'file-name1235.mp4',
 ' file_name1236.mp4',
 'file_name1237.mp4']

So sometimes you think re.split when you need re.findall, and the reverse is also true.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • This is great! Thank you so much! Just one question, if in the future I were to have spaces or other symbols in the file names, would this be an issue or would it work the same? – Alexiz Hernandez Sep 11 '18 at 20:24
  • 1
    that'll work if you accept all the characters: `'(.*?(?:\.txt|\.mpd|\.mp4))'`. You may want to apply a `strip()` if needed – Jean-François Fabre Sep 11 '18 at 20:29