How to split a mixed string with numbers

Question

I have a data in a text file that contains "Test DATA_g004, Test DATA_g003, Test DATA_g001, Test DATA_g002".

Is it possible to sort it without the word "Test DATA_" so the data will be sorted like g001, g002, g003 etc?

I tried the .split("Test DATA_") method but it doesn't work.

def readFile():
    #try block will execute if the text file is found
    try:
        fileName = open("test.txt",'r')
        data = fileName.read().split("\n")
        data.sort (key=alphaNum_Key) #alternative sort function
        print(data)
    #catch block will execute if no text file is found
    except IOError:
        print("Error: File do not exist")
        return

#Human sorting
def alphaNum(text):
    return int(text) if text.isdigit() else text

#Human sorting
def alphaNum_Key(text):
    return [ alphaNum(c) for c in re.split('(\d+)', text) ]

score 7 · Accepted Answer · answered Dec 30 '15 at 17:26

7

You can do this using re.

import re
x="Test DATA_g004, Test DATA_g003, Test DATA_g001, Test DATA_g002"
print sorted(x.split(","),key= lambda k:int(re.findall("(?<=_g)\d+$",k)[0]))

Output:[' Test DATA_g001', ' Test DATA_g002', ' Test DATA_g003', 'Test DATA_g004']

answered Dec 30 '15 at 17:26

vks

67,027
10
91
124

1

The sorting fucntion works fine. However I am having trouble just sorting the "g001" Basically how to sort the data without the string"Test DATA_"? – Aurora_Titanium Dec 30 '15 at 17:28
@Aurora_Titanium (x.replace('TestData', '') for x in xs – Caridorc Dec 30 '15 at 17:30
@Aurora_Titanium i have sorted based on key `integers` at the end after `g_` – vks Dec 30 '15 at 17:30
2

Sorry! I totally forgot about it. I checked back my old project and found out that I used your solution :) The python community is way better than the C or Java community. They downvote everything! – Aurora_Titanium Jan 03 '17 at 14:30

Iron Fist · Answer 2 · 2015-12-30T17:40:29.227

4

Retrieve all strings starting with g and then sort the list with sorted

>>> s = "Test DATA_g004, Test DATA_g003, Test DATA_g001, Test DATA_g002, "
>>> sorted(re.findall(r'g\d+$', s))
['g001', 'g002', 'g003', 'g004']

Another way, is to use only built-in methods:

>>> l = [x.split('_')[1] for x in s.split(', ') if x]
>>> l
['g004', 'g003', 'g001', 'g002']
>>> l.sort()
>>> l
['g001', 'g002', 'g003', 'g004']

edited Dec 30 '15 at 17:40

answered Dec 30 '15 at 17:31

Iron Fist

10,739
2
18
34

3

Very nice solution. Elegant and clean. – erip Dec 30 '15 at 17:49

score 3 · Answer 3 · edited Jun 20 '20 at 09:12

Yes, you can. You can sort by the last 3 digits in each test substring:

# The string to be sorted by digits
s = "Test DATA_g004, Test DATA_g003, Test DATA_g001, Test DATA_g002"

# Create a list by splitting at commas, sort the last 3 characters of each element in the list as `ints`.
l = sorted(s.split(','), key = lambda x: int(x[-3:]))

print l
# [' Test DATA_g001', ' Test DATA_g002', ' Test DATA_g003', 'Test DATA_g004']

You'll want to trim the elements of l if that's important to you, but this will work for all Tests that end in 3 digits.

If you don't want Test DATA_, you can do this:

# The string to be sorted by digits
s = "Test DATA_g004, Test DATA_g003, Test DATA_g001, Test DATA_g002"

# Create a list by taking the last 4 characters of sorted strings with key as last 3 characters of each element in the list as `int`s.
l = sorted((x[-4:] for x in s.split(',')), key = lambda x: int(x[-3:]))

print l
# ['g001', 'g002', 'g003', 'g004']

If your data is well-formed (i.e., g followed by 3 digits), this will work quite well. Otherwise, use a regex from any of the other posted answers.

Another alternative is to push strings into a PriorityQueue as you read them:

test.py

from Queue import PriorityQueue

q = PriorityQueue()

with open("example.txt") as f:
  # For each line in the file
  for line in f:
    # Create a list from the stripped, split-at-comma string
    for s in line.strip().split(','):
      # Push the last four characters of each element in the list into the pq
      q.put(s[-4:])

while not q.empty():
  print q.get()

The benefit of using a PQ is that it will add them in sorted order, which takes the burden off of you, and it is done in linear time.

example.txt

Test DATA_g004, Test DATA_g003, Test DATA_g001, Test DATA_g002

And the output:

13:25 $ python test.py 
g001
g002
g003
g004

I appreciate your use of slicing over `re` for such simple and normal-looking data. I think it makes the answer, and what the OP was missing, clearer. — Zach Young, Dec 30 '15 at 17:35

score 2 · Answer 4 · edited May 23 '17 at 12:16

Sounds like you want "natural sorting". The following, copied from https://stackoverflow.com/a/4836734/3019689 , might do it.

import re

def natural_sort(l): 
    convert = lambda text: int(text) if text.isdigit() else text.lower() 
    alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ] 
    return sorted(l, key = alphanum_key)

However, you keep saying you want to sort "without the Test DATA_" which suggests to me you're not telling the whole story. If it was literally Test DATA_ every time, it would not affect the sort: sort with or without it; it wouldn't matter. I bet you're really worried about the fact that this string prefix actually varies from filename to filename, and you want to ignore it completely whatever it is and focus only on the numeric part. If this is the case, you can substitute else None for else text.lower() in the above listing.

score 0 · Answer 5 · answered Apr 24 '19 at 16:23

import re

def natural_sort(l): 
    convert = lambda text: int(text) if text.isdigit() else text.lower() 
    alphanum_key = lambda key: [ convert(c) for c in re.split('(\d+)', key) ] 
    return sorted(l, key = alphanum_key)

This code snippet should work fine. This kind of sorting is called Natural sorting, which is usually used in Alphanumeric cases.

How to split a mixed string with numbers

5 Answers5

test.py

example.txt