1

I am new to regex, I want to extract specific words within a python string. This is the string:

'1. feature name: occupation_Transport-moving<br>coefficient: 0.1776<br>2. feature name: education<br>coefficient: 0.0726<br>3. feature name: occupation_Machine-op-inspct<br>coefficient: 0.0661<br>4. feature name: occupation_Armed-Forces<br>coefficient: 0.0006<br>5. feature name: workclass_Without-pay<br>coefficient: -0.0194<br>6. feature name: occupation_Handlers-cleaners<br>coefficient: -0.1256<br>7. feature name: occupation_Farming-fishing<br>coefficient: -0.3938<br>8. feature name: GDP Group<br>coefficient: -0.4138<br>9. feature name: occupation_Other-service<br>coefficient: -0.4294<br>10. feature name: occupation_Priv-house-serv<br>coefficient: -0.6560<br>'

The result I am looking for:

[occupation_Transport-moving,education,occupation_Machine-op-inspct,occupation_Armed-Forces,workclass_Without-pay,occupation_Handlers-cleaners,occupation_Farming-fishing,GDP Group,occupation_Other-service,occupation_Priv-house-serv]

I have tried this but it does return the whole string starting from:: re.findall(':\s(.*)<',txt)

Thank you in advance for your assistance.

stern15
  • 37
  • 4

1 Answers1

1

Use

:\s*([^:.<]+)<

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  :                        ':'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [^:.<]+                  any character except: ':', '.', '<' (1
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  <                        '<'
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37