Extract names or split text into columns

Question

update: I have been asked to provide as csv instead of a picture. Here is a google sheet with 40 rows of the data. The actual csv is 20k-40k rows. https://docs.google.com/spreadsheets/d/1HsqF_oqqoBoN2nwSH_7SQI_5w5D-THRI-CQslNW6LkY/edit?usp=sharing

I have rows of data that I need to split into columns. I know all the simple ways of doing this with pandas but can't use a simple solution like just splitting on a "," or " ". I have attached a picture below to show an example of my data. This is likely quite simple for someone with more skill than me. The reason I can't just split on a space and work with it that way is some players have "jr." at the end of their name or there is a guy named "chi chi" and the space in his first name can ruin everything. My ideal output would be extracting the name after each position (P,C,1B,2B,3B,SS,OF). So if the row was P John Smith P Dave Smith C Ryan Smith it would output "John Smith" in one column and "Dave Smith" in the next column and so on...

There are (2) "P" and (3) OF in each row, everything else is (1).

Data example:

I advise that you anonymize your data if you have not done so in your example, full names are visible — Adrien, Jul 27 '21 at 12:57
Welcome to stackoverflow, please read [tour] and [mre] and in this case also: [how-to-make-good-reproducible-pandas-examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — Andreas, Jul 27 '21 at 12:59
On top of that, please post a dummy csv in plain text rather than as an image. — RJ Adriaansen, Jul 27 '21 at 12:59
These are professional athletes that why I did not anonymize the data, Apologies for my poor post, I've added a csv and will be better in the future. — jf44, Jul 27 '21 at 14:15

Alexey · Accepted Answer · 2021-07-27T13:54:26.227

0

I advise you to use df.str.split by regular expression. Then str.replace by regex your delimiters (P|C|1B|2B|3B|SS|OF) with nan and remove columns with nan.

Try this:

 import pandas as pd 
 import numpy as np

 df=df['Lineup'].str.split(r'(P|C|1B|2B|3B|SS|OF)', expand=True).astype('str')

 df=df.replace(r'(P|C|1B|2B|3B|SS|OF)',np.nan,regex=True).dropna(axis=1)

edited Jul 27 '21 at 13:54

answered Jul 27 '21 at 13:05

Alexey

386
2
9

Would this handle the fact that there are 2 "P" and 3 "OF" in each row? – jf44 Jul 27 '21 at 14:27
Certainly, check it out – Alexey Jul 27 '21 at 14:29
You can complete my code in one line as well for short – Alexey Jul 27 '21 at 14:33

hpchavaz · Answer 2 · 2021-07-27T17:47:30.220

May be you could try something like re.split(r"\s*(P|C|1B|2B|3B|SS|OF)\s*", data)[2::2]

data = "P rzeazetrze C tzer zer tzer 1B zer t zer 2B zertzetz 3B erztzer SS zertzer OF tertzertze erztzert"
names = re.split(r"\s*(P|C|1B|2B|3B|SS|OF)\s*", data)[2::2]
print(names)

=>

['rzeazetrze',
 'tzer zer tzer',
 'zer t zer',
 'zertzetz',
 'erztzer',
 'zertzer',
 'tertzertze erztzert']

But it could be fancier:

myiter = iter(re.split(r"\s*(P|C|1B|2B|3B|SS|OF)\s*", data)[1:])
positions = {pos:name for pos, name in zip(myiter, myiter)}
print(positions)

=>

{'P': 'rzeazetrze',
 'C': 'tzer zer tzer',
 '1B': 'zer t zer',
 '2B': 'zertzetz',
 '3B': 'erztzer',
 'SS': 'zertzer',
 'OF': 'tertzertze erztzert'}

Completed in accordance to the comment of jf44

If there are is the possibility that a "key" has multiple values

data = "P rzeazetrze P AAAAAAAAaaa C tzer zer tzer 1B zer t zer 2B zertzetz 3B erztzer SS zertzer OF tertzertze erztzert OF razetaz OF rzea trze tez"

You could get for example get a list of tuples :

myiter = iter(re.split(r"\s*(P|C|1B|2B|3B|SS|OF)\s*", data)[1:])
positions = [(pos, name) for pos, name in zip(myiter, myiter)]
print(positions)

=>

[('P', 'rzeazetrze'),
 ('P', 'AAAAAAAAaaa'),
 ('C', 'tzer zer tzer'),
 ('1B', 'zer t zer'),
 ('2B', 'zertzetz'),
 ('3B', 'erztzer'),
 ('SS', 'zertzer'),
 ('OF', 'tertzertze erztzert'),
 ('OF', 'razetaz'),
 ('OF', 'rzea trze tez')]

Or fancier a dict of lists

from collections import defaultdict
myiter = iter(re.split(r"\s*(P|C|1B|2B|3B|SS|OF)\s*", data)[1:])
positions = defaultdict(list)
for pos, name in zip(myiter, myiter):
    positions[pos].append(name)
print(positions)

=>

defaultdict(list,
            {'P': ['rzeazetrze', 'AAAAAAAAaaa'],
             'C': ['tzer zer tzer'],
             '1B': ['zer t zer'],
             '2B': ['zertzetz'],
             '3B': ['erztzer'],
             'SS': ['zertzer'],
             'OF': ['tertzertze erztzert', 'razetaz', 'rzea trze tez']})

then

print(positions['P'])
print(positions['C'])

=>

['rzeazetrze', 'AAAAAAAAaaa']
['tzer zer tzer']

Would this handle the fact that there are 2 "P" and 3 "OF" in each row? — jf44, Jul 27 '21 at 14:27
I voted for the other answer because it solved my current need simply. However, I believe your answer has more long term value. Not only do I think I’ll need it later in this project but it leveled up my ability to use python and I’m very thankful for that. — jf44, Jul 30 '21 at 13:35

Extract names or split text into columns

2 Answers2