1

I have a dataframe with the following string format.

data.description[4000]=['Conduit, PVC Utility Type DB 60 TC-6, 1-1/2"                                   LF   .050   $.86   $1.90   $2.76']

the string varies in size but I would like be broken up splitting the string at the ' LF ' substring. The desired output would be

data2=['Conduit, PVC Utility Type DB 60 TC-6,1 -1/2"','LF',.050,'$.86','$1.90','$2.76]

If I were to have a list of units

units=['CLF','LF','EA']

How could I search the dataframe string and break the string in the aforementioned format? It seems splitting with unit delimiter would kinda work but I would lose the units. This gives me 2 strings which can be further split but it seems that it would require a row by row function.

Is there a better way to do this?

JAB
  • 12,401
  • 6
  • 45
  • 50
user3757265
  • 427
  • 1
  • 4
  • 11
  • You can check this question and its answer to see how to apply a function row by row: http://stackoverflow.com/questions/25292838/applying-regex-to-a-pandas-dataframe Otherwise, in order to fully answer the question i.e. be able to have all desired fields, we need to know exactly what's in the data (not just one line). Could you for instance split on "two blanks or more" (typically no if there are missing values)... or do columns have the same position in the string in each row etc. – etna Feb 22 '15 at 08:49

1 Answers1

1

You can use the string method split directly on the column with the text:

df['text'].str.split('(CLF|LF|EA)')

You can use capturing parentheses to keep the delimiter

Example:

units ='(CLF|LF|EA)'
df =pd.DataFrame({'text':['aaaaaaa LF bbbbbbbb','123456 CLF 78910','!!!!!!!! EA @@@@@@@@@@']})
df.text.str.split(units)

returns:

0       [aaaaaaa , LF,  bbbbbbbb]
1          [123456 , CLF,  78910]
2    [!!!!!!!! , EA,  @@@@@@@@@@]
Name: text, dtype: object
JAB
  • 12,401
  • 6
  • 45
  • 50