0

I have a file that is made of lines of the following format:-

[123, something, some other thing, "text that i want", more details]

eg:-

[1393349463, u'Tue Feb 25 17:31:03 +0000 2014', 438365537261735936, u'A Falcon character poster for Captain America: The Winter Soldier has swooped in', [], [u'totalfilm'], [u'//1bJdCJ2'], [u'http://pbs.twimg.com/media/BhViUNICQAAoBue.jpg'], 369, 362]

Now i want to read this as list directly into python instead of a string first and then spliting the string by , and joining it back and all because the text section can have a ',' and I dont want to split that.

I am looking for something like this:

with open("input.txt") as fp:
   for line in fp:
       corpus.append(line[3]) #read only text
  • 1
    Did you create this file in the first place? If so, the right thing to do is to fix the way you create this file—use some format that's meant to be stored as text and then parsed back in, like JSON. – abarnert Apr 29 '18 at 05:12
  • 1
    No I didn't and I don't have the control over this input. – Divyang Vashi Apr 29 '18 at 05:13
  • 2
    If this is part of a homework assignment, or a job, you should at least make sure your teacher or boss or whoever recognizes that they've given you a task that should have been trivial, but is instead painful and brittle, purely because someone else is using Python `repr` as a persistence format, which is a terrible idea and easy to fix at the source. – abarnert Apr 29 '18 at 05:18
  • 1
    @Mulliganaceous That question's answer won't work, unless you can guarantee that none of the strings have any commas, or backslash-escaped backslashes or quotes or special characters. (Plus, you'd still need a way to remove the `u'…'` around each string.) – abarnert Apr 29 '18 at 05:27
  • 1
    @information_interchange No, that's not the same problem as this. The OP here knows how to get each line, and knows how to append something to a list for each line, so telling him how to get a list of each line won't help him. He needs to know how to _parse_ each of those lines. – abarnert Apr 29 '18 at 05:28

1 Answers1

1

Your input is obviously generated by calling just printing out Python lists (or calling str or repr on them).

This particular example can be handled by using literal_eval:

with open("input.txt") as fp:
    for line in fp:
        obj = ast.literal_eval(line)
        corpus.append(obj[3])

However, that won't work for all Python list displays in general. And when it doesn't work… well, there's not much you can do in general. But you can just literal_eval until you get an error, and then, for each error, laboriously work out how to pre-process things to work around it.

The right thing to do is generate output that's actually parseable, like JSON, and then you can just parse it trivially.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • what is ast here? – Divyang Vashi Apr 29 '18 at 05:21
  • 1
    @DivyangVashi It's the module that `literal_eval` is in. Click the link in the question to see its docs. The `ast` module is designed for parsing Python code. The reason` literal_eval` is buried there is that it uses the `ast` module to parse the input as Python source—and to discourage people from using `repr` and `literal_eval` as a persistence method, because that would be an attractive nuisance (it's a terrible idea, but it might _seem_ like a good idea at first…). – abarnert Apr 29 '18 at 05:25