-2

The first line of my input file looks like this:

<doc id="12" url="http://en.wikipedia.org/wiki?curid=12" title="Anarchism">

I want store them as key-value pair like this in python:

{doc_id: 12, url: http://en.wikipedia.org/wiki?curid=12, title: Anarchism} 

Here is my code:

infile=open('wiki_00').readline().rstrip()
infile.split()[1:]  

output looks like this:

['id="12"',
'url="http://en.wikipedia.org/wiki?curid=12"',
'title="Anarchism">']

But I would like the "", <> removed and id to be stored as type int

Technologic27
  • 341
  • 3
  • 4
  • 13
  • Why does the tag name only get attached to the `id` and no other attribute? – TigerhawkT3 Jan 19 '17 at 02:43
  • Do you always want to prefix the `id` attribute with the tag name? – pushkin Jan 19 '17 at 02:51
  • @pushkin ok not necessary. it can look like this id:12 – Technologic27 Jan 19 '17 at 02:53
  • Do you need to remember the tag name at all, or is it irrelevant? – pushkin Jan 19 '17 at 02:56
  • 2
    It looks like you want us to write some code for you. While many users are willing to produce code for a coder in distress, they usually only help when the poster has already tried to solve the problem on their own. A good way to demonstrate this effort is to include the code you've written so far, example input (if there is any), the expected output, and the output you actually get (output, tracebacks, etc.). The more detail you provide, the more answers you are likely to receive. Check the [FAQ](http://stackoverflow.com/tour) and [How to Ask](http://stackoverflow.com/questions/how-to-ask). – TigerhawkT3 Jan 19 '17 at 02:59
  • @TigerhawkT3 Hi i wrote a bit of code, please look at the edits – Technologic27 Jan 19 '17 at 03:07
  • Possible duplicate of [How to list all files of a directory in Python](http://stackoverflow.com/questions/3207219/how-to-list-all-files-of-a-directory-in-python) – Technologic27 Jan 19 '17 at 08:03

1 Answers1

0

Don't do line[1:] to strip away the brackets. Use the strip method: line.strip(' <>') will remove all whitespace and <> characters from the ends of the line.

Something like this will do what I think you want. You may want to add error handling.

def turn_line_into_dict(line):
    # remove the brackets and tag name
    line = line.strip(' <>')
    first_space_idx = line.find(' ')
    line_without_tag = line[first_space_idx+1:]

    attr_list = line_without_tag.split(' ')

    d = {}
    for attr_str in attr_list :
       key,value = attr_str.split('=', 1) # only search for first occurrence, so an '=' in the url doesn't screw this up
       d[key] = value.strip('"\'') # remove quotes and let the dict figure out the type

    return d
pushkin
  • 9,575
  • 15
  • 51
  • 95
  • `line.strip(' <>')` removes the space, `<`, and `>` characters from the ends of the line. It doesn't remove all whitespace, and it doesn't remove those characters if they're between other characters. – TigerhawkT3 Jan 19 '17 at 03:24
  • @TigerhawkT3 I chose to not worry about the details. I presented the general idea. OP can improve on it. However, why would I need to worry about `>` and `<` within the line? Second, the only issue I can imagine with not removing all spaces is that the tag name may be preceded by spaces so `line_without_tag` may be incorrect, but again, OP can deal with that if it's an issue. – pushkin Jan 19 '17 at 03:42
  • It's not about worrying about what's in the line, it's that you presented two incorrect facts. – TigerhawkT3 Jan 19 '17 at 03:43
  • @TigerhawkT3 Oh I see. Yeah I could have worded that better. – pushkin Jan 19 '17 at 13:59