0

I have a file that I have to parse that has a lot of links, and example of how it looks:

  <hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-     
  pls/facebook?funn=wordlis&sys;sys;colorsdif_id=11908675">colors</p></hm>

 <hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-
  pls/facebook?funn=wordlis&sys;sys;colorsdif_id=45103481">yelloW</p></hm>

  <td>I have a dream, and it is all good 2</hm>

 <hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-    
  pls/facebook?funn=wordlis&sys;sys;colorsdif_id=40984930">orangE</p></hm>

 <hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-
  pls/facebook?funn=wordlis&sys;sys;colorsdif_id=90648361">pinK</p></hm>

I only have to keep the words that are in the position of >colors< so I also want >yelloW<, >orangE< and >pinK<.

In this example, the common expression between them, will be all the link, except the number (the id, that it is a different number in all the links), and the word.

Just after finding all the words I want to save them in a dictionary, that use the first element as key and the others as elements, so the final result will be:

   d = {"colors": ["yelloW", "orangE", "pinK"]}

1 Answers1

0

You can try something like this:

import re
re.findall(r"http://[^>]+>(\w+)",ree)

Where:

  • [^>]+ - get any characters except >
  • \w+ - get any letters
  • (..) - return the group between parentheses

And Python dictionaries doesn't support identical keys. You can look at this question.

Aska
  • 18
  • 5