2

I have a (massive) list represented as a string (not like this, this is just an example)

"['A', 'B', 'C']"

and I need to make it a list type:

['A', 'B', 'C']

but if I do:

list("['A', 'B', 'C']")

obviously I'll get:

['[', "'", 'A', "'", ',', ' ', "'", 'B', "'", ',', ' ', "'", 'C', "'", ']']

Currently I'm using:

ast.literal_eval("['A', 'B', 'C']")

Except that the lists which my program is handling are huge, and the strings are millions of bytes (the test string is over 4 million characters). So my ast.literal_eval() is returning a MemoryError whenever I try to run it.

What I need therefore is a way (it doesn't have to be pythonic, elegant or even particularly efficient) to make these huge strings into lists without returning a memerror.

timrau
  • 22,578
  • 4
  • 51
  • 64
Arcayn
  • 87
  • 7

6 Answers6

3

The input data format is not exactly standard and it's not convenient to parse, especially since it got huge. Depending on where is the data coming from, you should either start keeping it in a real database, or think about ways to make it JSON parseable. For instance, if we would replace single quotes with double quotes in your current sample input, we can parse it with json:

>>> import json
>>> s = "['A', 'B', 'C']"
>>> json.loads(s.replace("'", '"'))
[u'A', u'B', u'C']

Then, once the data is JSON, it is a different and more common problem. You can use one of the incremental parsers, like ijson, or an event-driven yajl, to avoid memory errors.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
3

You may try using lazy parsing based on iterator interface and itertools module.

You may use e.g. itertools.takewhile:

def lazy_to_list(input_string):
    iterable = iter(input_string)
    next(iterable)  # skip [
    l = []
    while True:
        value = ''.join(itertools.takewhile(lambda c: c != ',', iterable))
        if not value:
            break
        if value.endswith("]"):
            value = value.rstrip("]")
        l.append(eval(value))
    return l


N = 1000000
s = repr(list(range(N)))
assert lazy_to_list(s) == list(range(N))

Additional improvement would be to lazy load huge string from file (since all processing is done lazily). Obviously, it'll break for commas in object representation (and probably much more reasons).

Anyway, it still feels like a solution for badly-defined problem. Depending of type of underlying data and external requirements (e.g. should file be readable for person, not only machine), you'll be better with standard serialization format (e.g. json, xml, pickle etc.)

Łukasz Rogalski
  • 22,092
  • 8
  • 59
  • 93
  • why return? it is not better to make it a lazy function to, after all it give memory error probably because the resulting list is too big – Copperfield Jan 01 '16 at 16:00
0

Ok sorry to waste your time guys, I found a really un-pythonic but effective solution after trying everything else:

str.split("', '")

And removing the end braces, since nowhere in any of the strings would that string crop up because of how it was used. There we go.

Arcayn
  • 87
  • 7
0

You could use the YAML library which is awesome (pip install pyyaml).

>>> import yaml
>>> yaml.load("['A', 'B', 'C']")
['A', 'B', 'C']

If you are reading from a file you can also do this:

>>> with open(myfile) as fid:
...     data = yaml.load(fid)
Brad Campbell
  • 2,969
  • 2
  • 23
  • 21
-2

You may have better luck using the Python built-in eval( str ) function.

eval("['A', 'B', 'C']")

returns a list object

['A', 'B', 'C']
Fred Truter
  • 667
  • 4
  • 10
  • I don't think so...it's not about evaluating the string...but to evaluate a very long string (millions of bytes) which will cause memory error... – Iron Fist Jan 01 '16 at 14:46
  • Yeah, and eval() is dangerous, which is why I'm using ast in the first place – Arcayn Jan 01 '16 at 14:58
  • @Arcayn gave a perfectly formatted string representation of a Python list as an example, and did not specify that the string was from an untrusted source, therefore using `eval` is **not unsafe**. Also it is stated that "I have a ... string" implying the string was already in memory. So we should avoid creating copies of it, or parts of it. But we *have* to create a list in memory as requested, which *will* require more memory to be allocated. As others have noted, it would be better to parse the string from a file-like object and never load it into memory in the first place. – Fred Truter Jan 02 '16 at 12:29
-2
>>> import ast
>>> input = "['A', 'B', 'C']"
>>> list = ast.literal_eval(input)
>>> output = [i.strip() for i in list]
>>> type(output)
<class 'list'>
>>> output
['A', 'B', 'C']
  • This masks _two_ built-in functions, then uses the _exact method that the OP stated does not work_. Did you read anything more than the title? – TigerhawkT3 Jan 02 '16 at 01:22