How to find unique starts of strings?

Question

If I have a list of strings (eg 'blah 1', 'blah 2' 'xyz fg','xyz penguin'), what would be the best way of finding the unique starts of strings ('xyz' and 'blah' in this case)? The starts of strings can be multiple words.

Because *the starts of strings can be multiple words*, then "blah 2" is a string start with count 1. Similarly, "b" is a string start with count 2. You have to be more precise about your definition of a string start. — Ewan Todd, Nov 19 '09 at 14:11
You cannot answer this question unless there is some condition for ending the start portion. How do you determine when a multi-word start portion ends? — Michael Dillon, Nov 19 '09 at 14:25
to clarify - what would you want the output to be if the list also included 'blat'? 'bla' and 'blah '? just 'bla'? just 'blah '? if the first why not also 'b', 'bl' and 'bla'? as others have said you need to tighten up your requirements — robince, Nov 19 '09 at 14:57

score 4 · Answer 1 · edited May 23 '17 at 12:19

Your question is confusing, as it is not clear what you really want. So I'll give three answers and hope that one of them at least partially answers your question.

To get all unique prefixes of a given list of string, you can do:

>>> l = ['blah 1', 'blah 2', 'xyz fg', 'xyz penguin']
>>> set(s[:i] for s in l for i in range(len(s) + 1))
{'', 'xyz pe', 'xyz penguin', 'b', 'xyz fg', 'xyz peng', 'xyz pengui', 'bl', 'blah 2', 'blah 1', 'blah', 'xyz f', 'xy', 'xyz pengu', 'xyz p', 'x', 'blah ', 'xyz pen', 'bla', 'xyz', 'xyz '}

This code generates all initial slices of every string in the list and passes these to a set to remove duplicates.

To get all largest initial word sequences smaller than the full string, you could go with:
```
>>> l = ['a b', 'a c', 'a b c', 'b c']
>>> set(s.rsplit(' ', 1)[0] for s in l)
{'a', 'a b', 'b'}
```
This code creates a set by splitting all strings at their rightmost space, if available (otherwise the while string will be returned).
On the other hand, to get all unique initial word sequences without considering full strings, you could go for:
```
>>> l = ['a b', 'a c', 'a b c', 'b c']
>>> set(' '.join(w[:i]) for s in l for w in (s.split(),) for i in range(len(w)))
{'', 'a', 'b', 'a b'}
```
This code splits each word at any whitespace and concatenates all initial slices of the resulting list, except the largest one. This code has pitfall: it will e.g. convert tabs to spaces. This may or may not be an issue in your case.

+1 to make up for the mystery downvote (seems like a good answer to a confused question) — robince, Nov 19 '09 at 14:58

score 2 · Answer 2 · edited Nov 19 '09 at 14:38

2

If you mean unique first words of strings (words being separated by space), this would be:

arr=['blah 1', 'blah 2' 'xyz fg','xyz penguin']
unique=list(set([x.split(' ')[0] for x in arr]))

edited Nov 19 '09 at 14:38

Peter Mortensen

30,738
21
105
131

answered Nov 19 '09 at 14:20

yu_sha

4,290
22
19

How to find unique starts of strings?

2 Answers2