0

Let's consider the following example code,

pre_process.py

import pandas as pd
from sklearn.preprocessing import LabelBinarizer

class PreProcess(object):

    def __init__(self):
        ... .... ....
        ... .... ....

C:  def fit_clms(self, lb_style, dataset, style_clms = ['A', 'B']):
B:        lb_results = lb_style.fit_transform(dataset[style_clms]) # exp. result is, "dataset['X', 'Y']", but it became to "dataset[['X', 'Y']]", pl note the nested list
        # (**Worked - by this line**) lb_results = lb_style.fit_transform(dataset['A', 'B', 'C'])
        print(lb_results)

        if lb_style.classes_.shape[0] > 0:
            ... .... ....
            ... .... ....

    def process_chunks(self, chunks):
        lb_style = LabelBinarizer()
        print('------------------------------------------------\n')
        count = 0
        for dataset in chunks:
            count += 1
            print ('Processing the Chunk %d ...' % count)

            # Group By
            dataset['Grouping_MS'] = dataset[['_time', 'source']].apply(self.group_by_clm, axis=1)
A:            dataset = self.fit_clms(lb_style, dataset, ['X', 'Y'])
                ... .... ....
                ... .... ....            

    def init(self):
        Times.start()
        # Read the Source File
        chunks = self.read_csv_file(SOURCE_FILE, CHUNK_SIZE)    
        self.process_chunks(chunks)
            ... .... ....
            ... .... ....            

Here, how to pass a list ['A', 'B'] (A:), and access it at "dataset[style_clms]" (B:)? (now it becomes to [['X', 'Y']], but i want ['X', 'Y'], i.e became to nested list)

Also, is it good to set a list as a "default" parameter (C:) in function definition? If not so, then any alt. ways to achieve this? Because of Pylint, gives a warning like "Dangerous default value [] as argument"

Any ideas? Thanks,

Jai K
  • 375
  • 1
  • 4
  • 12
  • 1
    To the last point, no [it's usually not a good idea](https://stackoverflow.com/questions/1132941/least-astonishment-and-the-mutable-default-argument). The rest I am struggling to understand. The letters on the left hand side are pretty distracting, it would be better to use comments at the end of each line or immediately above them. – roganjosh Oct 18 '18 at 12:29
  • 1
    usually you shouldn't use lists as default arguments because then every call to the function will have the same list object and if you change it (by appending something or removing something from it) it will change the list for the next time you call the function – AntiMatterDynamite Oct 18 '18 at 12:33
  • @roganjosh Thanks, I edited now. Pl have a look. – Jai K Oct 18 '18 at 12:38
  • @AntiMatterDynamite Thanks for the explanation, could you pl suggest any other work around for this situation? – Jai K Oct 18 '18 at 12:43
  • 1
    @JaiK yes you can use the default parameter of `None` then check in the function if it's `None` to give it the default list value you want, that way it will create a new function every time the function is called – AntiMatterDynamite Oct 18 '18 at 12:44
  • @AntiMatterDynamite Or, you can make it more explicit by using a `tuple` as default value and convert to a `list` if that is needed by the subsequent code. – norok2 Oct 18 '18 at 14:54
  • @norok2 well that's for this simple case of 2 variables but i would assume that the real data serialized is more complex and you can either parse it all in the C++ code and build a type to represent it all and then send it to python, or you can just do it on the python level - which is easier – AntiMatterDynamite Oct 18 '18 at 17:00

2 Answers2

1

That []-default-value thing catches a lot of people out, so I'll cover it first. When Python's running your code, it does this:

def append_two(a=[]):
    a.append(2)
    return a

print(append_two())
print(append_two([1, 2, 3])
print(append_two())

Oh, look! A function definition! Ok, so the default value is []; let's evaluate that... And some code, but let's not run that yet.

def append_two(a=<list object at 0x34FE2910>):
    ...

print(append_two())
print(append_two([1, 2, 3])
print(append_two())

Ok, now let's run it. Appending 2 to [] makes [2], so we print("[2]"). Appending 2 to [1, 2, 3] makes [1, 2, 3, 2], so we print("[1, 2, 3, 2]"). Appending 2 to [2] makes [2, 2], so we print("[2, 2]"). And done!

[2]
[1, 2, 3, 2]
[2, 2]

Why does this happen? Well, it was that first stage. Python, when evaluating the function, created The Default List for append_two. And that means that, if you don't pass in a list, it'll always append to that one. That list will slowly grow over time, as more 2s keep getting added to it.

The reason for this is consistency. When you run the function, only the stuff inside the function will run. Nowhere inside the function does it say "make a new list", so it doesn't. If you want it to, you have to tell it to, like so:

def append_two(a=None):
    if a is None:
        a = []  # Make a new list
    a.append(2)
    return a

This is clunky and annoying, but that's the price you have to pay for consistency. The alternatives are worse.


Now onto your main problem. I'll simplify it slightly.

class DemoClass:
    def __getitem__(self, index):
        return index
dataset = DemoClass()

style_clms = ["X", "Y"]
print(dataset[style_clms])

This prints ['X', 'Y']. Let's see what dataset["X", "Y"] prints:

>>> print(dataset["X", "Y"])
('X', 'Y')

Ok... This is called a tuple. It's easy enough to convert a list into a tuple:

>>> print(dataset[tuple(style_clms)])
('X', 'Y')

Hooray! We've successfully replicated dataset["X", "Y"] for arbitrary things! :-) This hopefully solves your problem.

wizzwizz4
  • 6,140
  • 2
  • 26
  • 62
  • Thanks for introducing "__getitem__()", but could you pl explain "How do I get access that 'thing' list inside of another function in the same "DemoClass", except the following ways, `code` class DemoClass: def __getitem__(self, index): #return index self.thing = index # Except, by using this line def fn(self): print(self.thing) # I want to access "[X, Y]" in here obj = DemoClass() thing = ["X", "Y"] obj[thing] obj.fn() # prints "['X', 'Y']" `code` – Jai K Oct 18 '18 at 13:17
  • @JaiK Oh... `DemoClass` was just so we could create an object behaving similarly to `dataset`; don't use `DemoClass` in your actual code. Just use the `dataset` you already had. – wizzwizz4 Oct 18 '18 at 14:22
0

Just flatten the list with this:

import itertools
flat_list = list(itertools.chain(*list2d))

or

flat_list = [item for sublist in l for item in sublist]
Novak
  • 2,143
  • 1
  • 12
  • 22