0

I get a list from reading the columns of an excel file, I need to validate if a column is in the list, if not then create it empty, but the conditional is not working, I know for a fact that 'City' is missing so I'm expecting it to be created and is not.

files = os.listdir(input_path)
files_xlsx = [f for f in files if f[-5:] == ".xlsx"]

df = pd.DataFrame()

for f in files_xlsx:
    wb = load_workbook(input_path + "\\" + f, read_only=True)
    if 'New Leads' in wb.sheetnames:
        df = pd.read_excel(input_path + "\\" + f, sheet_name="New Leads")
        colre = [col for col in df.columns if "Email" in col]
        for eo in colre:
            df.rename(columns={eo: eo.replace(' ','').replace('*','').replace('**','') for eo in colre}, inplace=True)
        dtcol = [col for col in df.columns if "FIRST NAME" in col.upper()]
        for ex in dtcol:
            df.rename(columns={ex: "First Name"}, inplace=True)
        dtcol = [col for col in df.columns if "LAST NAME" in col.upper()]
        for ex in dtcol:
            df.rename(columns={ex: "Last Name"}, inplace=True)

NOT WORKING PIECE OF CODE

        dtcol = [col for col in df.columns if "CITY" in col.upper()]
        for ex in dtcol:
            if len(dtcol)==0:
                df['City'] = NaN
            else:
                df.rename(columns={ex: "City"}, inplace=True)

END OF NOT WORKING PIECE OF CODE

        dtcol = [col for col in df.columns if "COMPANY NAME" in col.upper()]
        for ex in dtcol:
            df.rename(columns={ex: "*** Company"}, inplace=True)
        if "SOURCE" in cols:
            df['Email'].replace('', np.nan, inplace=True)
            df.dropna(subset=['Email'], inplace=True)
            if df.dtypes['SOURCE'] != 'float':
                df.dropna(how = 'all')
                df['File'] = f
                if df.SOURCE.map(lambda x: len(x)).max() <= 10:
                    df = pd.merge(df, df_ac[["SOURCE", "Lead Source", "Lead Category"]], on="SOURCE", how='left')
                    del df['SOURCE']
                    df.rename(columns={"Lead Source": "SOURCE"}, inplace=True)
                else:
                    df.rename(columns={"SOURCE": "Lead Source"}, inplace=True)
                    df = pd.merge(df, df_ac[["Lead Source", "Lead Category"]], on="Lead Source", how='left')
                    df.rename(columns={"Lead Source": "SOURCE"}, inplace=True)
                df_xlsx = pd.concat([df, df_xlsx], ignore_index=True)
            else:
                df_ns = df_ns.append([f], ignore_index=True)
        else:
            df_ns = df_ns.append([f], ignore_index=True)
    else:
        df_ns = df_ns.append([f], ignore_index=True)
Berny
  • 113
  • 11
  • Please provide the expected [MRE - Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). Show where the intermediate results deviate from the ones you expect. We should be able to paste a single block of your code into file, run it, and reproduce your problem. This also lets us test any suggestions in your context. Your posted code does not run. [Include your minimal data frame](https://stackoverflow.com/questions/52413246/how-to-provide-a-reproducible-copy-of-your-dataframe-with-to-clipboard) as part of the example. – Prune Feb 03 '21 at 01:41
  • You haven't fully addressed Prune's comment. We're looking for you to explain what you expect your code to do, what it actually does, and how what it actually does deviates from what you expect. – Acccumulation Feb 03 '21 at 01:56
  • So in this case, I'm expecting to create an empty column named: "City" since it wasn't found in the excel file, the if len(dtcol) ==0: is not working since i have placed some prints there, and it's not working. – Berny Feb 03 '21 at 02:01
  • Not related but `glob.glob('*.xlsx')` is, IMHO, a more pythonic way to get your list of Excel files. – accdias Feb 03 '21 at 02:02

2 Answers2

2

When you ask Python to iterate over an empty list, it does nothing, because there is nothing to iterate over. So putting if len(dtol)==0 inside the for-loop does nothing; if the list is empty, it never gets evaluated, and if the list isn't empty, it gets evaluated to false (and it's wasteful, since it has to be evaluated on each iteration of the loop, even though it's the same each time). You need to put it outside.

    dtcol = [col for col in df.columns if "CITY" in col.upper()]
    if dtcol:
        for ex in dtcol:
            df.rename(columns={ex: "City"}, inplace=True)
    else:
        df['City'] = NaN
Acccumulation
  • 3,491
  • 1
  • 8
  • 12
  • Thanks, it did worked but using df['City'] = "", using NaN was giving me this error: NameError: name 'NaN' is not defined – Berny Feb 03 '21 at 02:11
1

Consider the following pseudo-code:

1. H = a new house, painted red 
2. clr = the color of house `H`
3. paint `H` blue
4. print(clr)

What will be printed? red or blue?
The answer is: red

The variable clr does not change when house H gets re-painted.

Variable in python are usually COPIES of the original data.

Step 1. Have Sarah put on a brown sweater
Step 2. Let `x` be the color of Sarah's sweater. x = "brown".
Step 3. Tell Sarah to take off her brown sweater
Step 4. Tell Sarah to put on a green sweater
Step 5. What is x? Answer: x == "brown"

Python variables do not update automatically.

Most variables are a snap-shot in time.

1. Let `x` be the temperature of the food in the oven (376 degrees Fahrenheit)
2. wait 2 hours.
3. observe that the food is now cold (73 degrees Fahrenheit)
4. What is x? `x` is 376 degrees Fahrenheit

Below is your original code:

dtcol = [col for col in df.columns if "CITY" in col.upper()]
    for ex in dtcol:
        if len(dtcol)==0:
            df['City'] = NaN
        else:
            df.rename(columns={ex: "City"}, inplace=True)

Your code is has the same overall structure as the following:

dtcol = [1, 2, 3]
    for ex in dtcol:
        if len(dtcol)==0:
            print("hello world")

The length of dtcol will never change.

Also, there is no reason to check if len(dtcol)==0 inside of teh for-loop, because if you are inside of the for-loop it is guaranteed that the length of dtcol is at least one.

If the length of dtcol was zero, the for-loop would have never executed.

Note that in your code, len(dtcol) never changes inside of the for-loop.

Why do you check the length on each and every iteration of the for-loop?

There is no reason to put if len(dtcol)==0 inside of the loop.

Why did you write df['City'] = Nan?

NaN is not a valid python keyword. Did you mean float("NaN")?

I think you do not know what a for-loop is. Otherwise, you would understand that the following code will print nothing:


empty_list = []
for x in empty_list:
    print("A")

nonempty_list = [1, 2, 3]
for y in nonempty_list:
   if len(nonempty_list) == 0:
       print("B")

Your code makes as much sense as the following:

for each child `C` in the classroom:
   if there are no children in the classroom:
       say the name of child `C` out-loud

for each cookie in the cookie jar:
    if the cookie jar is empty:
        eat the cookie

for each day you work next week:
   if you do not work at all next week:
       go swimming
Toothpick Anemone
  • 4,290
  • 2
  • 20
  • 42