Using ast.literal_eval(...)
will work, but it requires special syntax that other CSV-reading software won't recognize, and uses an eval
statement which is a red flag.
Using eval can be dangerous, even though in this case you're using the safer literal_eval
option which is more restrained than the raw eval
function.
Usually what you'll see in CSV files that have many values in a single column is that they'll use a simple delimiter and quote the field.
For instance:
ID,Country,Cities
1,Canada,"Toronto;Ottawa;Montreal"
Then in python, or any other language, it becomes trivial to read without having to resort to eval
:
import csv
with open("data.csv") as fobj:
reader = csv.reader(fobj)
field_names = next(reader)
rows = []
for row in reader:
row[-1] = row[-1].split(";")
rows.append(row)
Issues with ast.literal_eval
Even though the ast.literal_eval
function is much safer than using a regular eval
on user input, it still might be exploitable. The documentation for literal_eval
has this warning:
Warning: It is possible to crash the Python interpreter with a sufficiently large/complex string due to stack depth limitations in Python’s AST compiler.
A demonstration of this can be found here:
>>> import ast
>>> ast.literal_eval("()" * 10 ** 6)
[1] 48513 segmentation fault python
I'm definitely not an expert, but giving a user the ability to crash a program and potentially exploit some obscure memory vulnerability is bad, and in this use-case can be avoided.
If the reason you want to use literal_eval
is to get proper typing, and you're positive that the input data is 100% trusted, then I suppose it's fine to use. But, you could always wrap the function to perform some sanity checks:
def sanely_eval(value: str, max_size: int = 100_000) -> object:
if len(value) > max_size:
raise ValueError(f"len(value) is greater than the max_size={max_size!r}")
return ast.literal_eval(value)
But, depending on how you're creating and using the CSV files, this may make the data less portable, since it's a python-specific format.