3

I have a csv file of such structure:

Id,Country,Cities
1,Canada,"['Toronto','Ottawa','Montreal']"
2,Italy,"['Rome','Milan','Naples', 'Palermo']"
3,France,"['Paris','Cannes','Lyon']"
4,Spain,"['Seville','Alicante','Barcelona']"

The last column contains a list, but it is represented as a string so that it is treated as a single element. When parsing the file, I need to have this element as a list, not a string. So far I've found the way to convert it:

L = "['Toronto','Ottawa','Montreal']"
seq = ast.literal_eval(L)

Since I'm a newbie in python, my question is -- is this normal way of doing this, or there's a right way to represent lists in CSV so that I don't have to do conversions, or there's a simpler way to convert?

Thanks!

kederrac
  • 16,819
  • 6
  • 32
  • 55
Mark
  • 6,052
  • 8
  • 61
  • 129
  • I'm sure this link will help you https://stackoverflow.com/questions/1894269/convert-string-representation-of-list-to-list – kederrac Jan 29 '20 at 22:44

3 Answers3

2

If you can control the CSV, you could separate the items with some other known character that isn't going to be in a city and isn't a comma. Say colon (:).

Then row one, for example, would look like this:

1,Canada,Toronto:Ottawa:Montreal

When it comes to processing the data, you'll have that whole element, and you can just do

cities.split(':')

If you want to go the other way (you have the cities in a Python list, and you want to create this string) you can use join()

':'.join(['Toronto', 'Ottawa', 'Montreal'])
kederrac
  • 16,819
  • 6
  • 32
  • 55
blueteeth
  • 3,330
  • 1
  • 13
  • 23
  • Separating with special characters is inventing a wheel. CSV standard has notion for double-quotes that handles delimiters inside. For TS there's no difference if it's `.split` or deserializing json – Slam Jan 29 '20 at 23:26
2

Using ast.literal_eval(...) will work, but it requires special syntax that other CSV-reading software won't recognize, and uses an eval statement which is a red flag.

Using eval can be dangerous, even though in this case you're using the safer literal_eval option which is more restrained than the raw eval function.

Usually what you'll see in CSV files that have many values in a single column is that they'll use a simple delimiter and quote the field.

For instance:

ID,Country,Cities
1,Canada,"Toronto;Ottawa;Montreal"

Then in python, or any other language, it becomes trivial to read without having to resort to eval:

import csv

with open("data.csv") as fobj:
    reader = csv.reader(fobj)
    field_names = next(reader)

    rows = []
    for row in reader:
        row[-1] = row[-1].split(";")
        rows.append(row)

Issues with ast.literal_eval

Even though the ast.literal_eval function is much safer than using a regular eval on user input, it still might be exploitable. The documentation for literal_eval has this warning:

Warning: It is possible to crash the Python interpreter with a sufficiently large/complex string due to stack depth limitations in Python’s AST compiler.

A demonstration of this can be found here:

>>> import ast
>>> ast.literal_eval("()" * 10 ** 6)
[1]    48513 segmentation fault  python

I'm definitely not an expert, but giving a user the ability to crash a program and potentially exploit some obscure memory vulnerability is bad, and in this use-case can be avoided.

If the reason you want to use literal_eval is to get proper typing, and you're positive that the input data is 100% trusted, then I suppose it's fine to use. But, you could always wrap the function to perform some sanity checks:

def sanely_eval(value: str, max_size: int = 100_000) -> object:
    if len(value) > max_size:
        raise ValueError(f"len(value) is greater than the max_size={max_size!r}")
    return ast.literal_eval(value)

But, depending on how you're creating and using the CSV files, this may make the data less portable, since it's a python-specific format.

damon
  • 14,485
  • 14
  • 56
  • 75
  • thanks for feedback. Did you probably mean `rows.append(row[-1].split(";"))` because otherwise it complains `AttributeError: 'list' object has no attribute 'split'` – Mark Jan 30 '20 at 22:03
  • @Mark Thank you for spotting that! I've updated my example. The code is slightly different from yours so that it preserves the whole row, and not just the _cities_ column. – damon Jan 31 '20 at 00:42
  • could you explain why `eval` statement is a big red flag? Is it considered dangerous? – Mark Feb 06 '20 at 15:05
  • @Mark I've added detail to my answer. I'm assuming you meant `literal_eval`, rather than `eval`. If you want info about why calling `eval("user input...")` is dangerous, look here: https://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html – damon Feb 11 '20 at 19:54
0

For the specific structure of the csv, you could convert cities to list like this:

cities = '''"['Rome','Milan','Naples', 'Palermo']"'''

cities = cities[2:-2]  # remove "[ and ]"

print(cities)  # 'Rome','Milan','Naples', 'Palermo'

cities = cities.split(',')  # convert to list

print(cities)  # ["'Rome'", "'Milan'", "'Naples'", " 'Palermo'"]

cities = [x.strip() for x in cities]  # remove leading or following spaces (if exists)

print(cities)  # ["'Rome'", "'Milan'", "'Naples'", "'Palermo'"]

cities = [x[1:-1] for x in cities]  # remove quotes '' from each city

print(cities)  # ['Rome', 'Milan', 'Naples', 'Palermo']
Giannis Clipper
  • 707
  • 5
  • 9