1

There are already questions in this direction, but in my situation I have the following problem:

The column alias contains dictionaries. If I use the csv reader I get strings.

I have solved this problem with ast eval, but it is very slow and consumes a lot of resources.

The alternative json.loads does not work because of encoding.

Some Ideas to solve this?

CSV File:

id;name;partei;term;wikidata;alias
2a24b32c-8f68-4a5c-bfb4-392262e15a78;Adolf Freiherr Spies von Büllesheim;CDU;10;Q361600;{}
9aaa1167-a566-4911-ac60-ab987b6dbd6a;Adolf Herkenrath;CDU;10;Q362100;{}
c371060d-ced3-4dc6-bf0e-48acd83f8d1d;Adolf Müller;CDU;10;Q363453;{'nl': ['Adolf Muller']}
41cf84b8-a02e-42f1-a70a-c0a613e6c8ad;Adolf Müller-Emmert;SPD;10;Q363451;{'de': ['Müller-Emmert'], 'nl': ['Adolf Muller-Emmert']}
15a7fe06-8007-4ff0-9250-dc7917711b54;Adolf Roth;CDU;10;Q363697;{}

Code:

with open(PATH_CSV+'mdb_file_2123.csv', "r", encoding="utf8") as csv8:
    csv_reader = csv.DictReader(csv8, delimiter=';')
    for row in csv_reader:

        if not (ast.literal_eval(row['alias'])):
            pass

        elif (ast.literal_eval(row['alias'])):
            known_as_list = list()

            for values in ast.literal_eval(row['alias']).values():
                for aliases in values:
                    known_as_list.append(aliases)

Its working good, but very slowly.

madik_atma
  • 787
  • 10
  • 28
  • 1
    Just a drop in the ocean, but you can call the `ast.literal_eval()` only once, removing the `pass` code. So something like: `if (ast.literal_eval(row['alias'])): ....`. – toti08 Aug 06 '18 at 11:18
  • Similar to what @toti08 says, you can compute the `ast.literal_eval()` once, store it in a variable, and then refer to that variable. This'll save time and resources as you don't need to repeatedly compute the `ast.literal_eval()`. – Adi219 Aug 06 '18 at 11:27
  • 1
    its working with speed O(n*m**2) where n=row m = length of dict, thats why its a bit slow, try removing last for, and just type known_as_list.append(aliases), cuz dict values() gives same as for aliases in values: in my code – M. Ali Öztürk Aug 06 '18 at 11:29
  • How big is your file? Is using multiple processes an option? – Darkonaut Aug 06 '18 at 15:39

2 Answers2

1

ast library consumes lots of memory (refer this link) and I would suggest avoid using that while converting a simple string of dictionary format into python dictionary. Instead we can try python's builtin eval function to overcome latency due to imported modules. As some discussions suggest eval is extremely dangerous while dealing with strings which are sensitive. Example: eval('os.system("rm -rf /")'). But if we are very sure that that the csv content will not carry such sensitive commands we can make use of eval without worrying.

with open('input.csv', encoding='utf-8') as fd:
    csv_reader = csv.DictReader(fd, delimiter=';')

    for row in csv_reader:
        # Convert dictionary in string format to python format
        row['alias'] = eval(row['alias'])

        # Filter empty dictionaries
        if not bool(row['alias']):
            continue

        known_as_list = [aliases for values in row['alias'].values() for aliases in values]

        print(known_as_list)

Output

C:\Python34\python.exe c:\so\51712444\eval_demo.py
['Adolf Muller']
['Müller-Emmert', 'Adolf Muller-Emmert']
Swadhikar
  • 2,152
  • 1
  • 19
  • 32
0

You can avoid calling literal_eval three times (one is sufficient) — while I was at it I've cleaned up, or so I think, your code using a SO classic (3013 upvotes!) contribution

from ast import literal_eval

# https://stackoverflow.com/a/952952/2749397 by Alex Martelli
flatten = lambda l: [item for sublist in l for item in sublist]
...

for row in csv_reader:
    known_as_list = flatten(literal_eval(row['alias']).values())

From the excerpt of data shown from the OP, it seems to be possible to avoid calling literal_eval on a significant part of the rows

...
for row in csv_reader:
    if row['alias'] != '{}':
        known_as_list = flatten(literal_eval(row['alias']).values())
gboffi
  • 22,939
  • 8
  • 54
  • 85