0

I have a python list which is derived from a pandas series as follows:

dsa = pd.Series(crew_data['Work Type'])
disc = [dsa]
print(disc)

The output is as follows:

[0      Disc - Standard Removal & Herbicide 
 1      Disc - Standard Removal & Herbicide  
 2                            Standard Trim  
 3                       Disc - Hazard Tree  
 4                       Disc - Hazard Tree  
                  ...                   
 134                     Disc - Hazard Tree  
 135                     Disc - Hazard Tree  
 136                     Disc - Hazard Tree  
 137                     Disc - Hazard Tree  
 138                     Disc - Hazard Tree  
 Name: Work Type, Length: 139, dtype: object]

Now the next step is to slice the first 4 characters of each element so that the value returned is Disc

This appears to be simple when performed on a single string, however when attempting to do this with a list for some reason appears to be almost impossible. This can be done simply in Excel using the formula =LEFT(A1,4), so surely it can be done as simple in python?

If anyone has a solution that would be great.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
jasw
  • 33
  • 2
  • 7
  • Is this list one big string, or are there multiple objects in the list? Could you provide a better example? – PacketLoss Jan 29 '20 at 00:51
  • No these are individual objects. They represent a category code for each individual task in the system database – jasw Jan 29 '20 at 00:54
  • Is there a reason you call `pd.Series()` on `crew_data['column']`? Typically, if `crew_data` is a `DataFrame`, getting a single columns will already give you a `Series`? – Grismar Jan 29 '20 at 00:54
  • Depending on some of the details that aren't clear in your question, your question may have already been answered here https://stackoverflow.com/questions/36505847/substring-of-an-entire-column-in-pandas-dataframe – Grismar Jan 29 '20 at 00:55
  • Thanks for the link. That worded perfectly. Everything that I searched on this topic provided a function with a for loop or something far more condeluded that didn't work... – jasw Jan 29 '20 at 01:03
  • Does this answer your question? [substring of an entire column in pandas dataframe](https://stackoverflow.com/questions/36505847/substring-of-an-entire-column-in-pandas-dataframe) – AMC Jan 29 '20 at 05:38

2 Answers2

2

With a sample dataframe

In [138]: df                                                                                     
Out[138]: 
  col1  col2 col3 newcol
0    a     1    x    Wow
1    b     2    y    Dud
2    c     1    z    Wow
In [139]: df['newcol']                                                                           
Out[139]: 
0    Wow
1    Dud
2    Wow
Name: newcol, dtype: object
In [140]: type(_)                                                                                
Out[140]: pandas.core.series.Series

Selecting a column gives me a Series; no need for another Series wrapper

In [141]: pd.Series(df['newcol'])                                                                
Out[141]: 
0    Wow
1    Dud
2    Wow
Name: newcol, dtype: object

We can put it in a list, but that doesn't do any good:

In [142]: [pd.Series(df['newcol'])]                                                              
Out[142]: 
[0    Wow
 1    Dud
 2    Wow
 Name: newcol, dtype: object]
In [143]: len(_)                                                                                 
Out[143]: 1

We can extract the values as a numpy array:

In [144]: pd.Series(df['newcol']).values                                                         
Out[144]: array(['Wow', 'Dud', 'Wow'], dtype=object)

We can apply a string slicing to each element of either the array or series - with a list comprehension:

In [145]: [astr[:2] for astr in _144]                                                            
Out[145]: ['Wo', 'Du', 'Wo']
In [146]: [astr[:2] for astr in _141]                                                            
Out[146]: ['Wo', 'Du', 'Wo']

The list comprehension isn't necessarily the most 'advanced' way, but it's a good start. Actually it is close to the best, since slicing a string has to use string methods; no one else implements string slicing.

pandas has a str method for applying string methods to a series:

In [147]: ds = df['newcol']  
In [151]: ds.str.slice(0,2)        # or ds.str[:2]                                                               
Out[151]: 
0    Wo
1    Du
2    Wo
Name: newcol, dtype: object

This is cleaner and prettier than the list comprehensions, but actually slower.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
0

I might be missing the gist of the question, but here's a regular expression implementation.

import re

# Sample data
disc = ['                       Disc - Standard Removal & Herbicide ',
 '      Disc - Standard Removal & Herbicide  ',
'                           Standard Trim  ',
'                       Disc - Hazard Tree',
'                      Disc - Hazard Tree ',]

# Regular Expression pattern
# We have Disc in parenthesis because that's what we want to capture.
# Using re.search(<pattern>, <string>).group(1) returns the first matching group. Using just
# re.search(<pattern>, <string>).group() would return the entire row.
disc_pattern = r"\s+?(Disc)\s+?"

# List comprehension that skips rows without 'Disc'
[re.search(disc_pattern, i).group(1) for i in disc if re.match(disc_pattern, i)]

Output:

['Disc', 'Disc', 'Disc', 'Disc']
Mark Moretto
  • 2,344
  • 2
  • 15
  • 21