-1

I have a large data file as shown below. Example Data File

Edited to include an updated example:

Updated Example

I wanted to add two new columns (E and F) next to column D and move the suite # when applicable and City/State data in cell D3 and D4 to E2 and F2, respectively. The challenge is not every entry has the suite number. I would need to insert a row first for those entries that don't have the suite number, only for them, not for those that already have the suite information.

I know how to do loops, but am having trouble to define the conditions. One way is to count the length of the string. How should I get started? Much appreciate your help!

Michael J.
  • 15
  • 4

1 Answers1

1

This is how I would do it. I don't recommend looping when using pandas. There are a lot of tools that it is often not needed. Some caution on this. Your spreadsheet has NaN and I think that is actually numpy np.nan equivalent. You also have blanks I am thinking that it is a "" equivalent.

import pandas as pd
import numpy as np

# dictionary of your data
companies = {
    'Comp ID': ['C1', '', np.nan, 'C2', '', np.nan, 'C3',np.nan],
    'Address': ['10 foo', 'Suite A','foo city', '11 spam','STE 100','spam town', '12 ham', 'Myhammy'],
    'phone': ['888-321-4567', '', np.nan, '888-321-4567', '', np.nan, '888-321-4567',np.nan],
    'Type': ['W_sale', '', np.nan, 'W_sale', '', np.nan, 'W_sale',np.nan],
}
# make the frames needed. 
df = pd.DataFrame( companies)
df1 = pd.DataFrame() # blank frame for suite and town columns

# Edit here to TEST the data types 
for r in range(0, 5):
    v = df['Comp ID'].values[r]
    print(f'this "{v}" is a ', type(v))

# So this will tell us the data types so we can construct our where(). Back to prior answer....

# Need a where clause it is similar to a if() statement in excel
df1['Suite'] = np.where( df['Comp ID']=='', df['Address'], np.nan)
df1['City/State'] = np.where( df['Comp ID'].isna(), df['Address'], np.nan)
# copy values to rows above
df1 = df1[['Suite','City/State']].backfill()
# joint the frames together on index
df = df.join(df1)
df.drop_duplicates(subset=['City/State'], keep='first', inplace=True)
# set the column order to what you want
df = df[['Comp ID', 'Type', 'Address', 'Suite', 'City/State', 'phone' ]]

output

Comp ID Type Address Suite City/State phone
C1 W_sale 10 foo Suite A foo city 888-321-4567
C2 W_sale 11 spam STE 100 spam town 888-321-4567
C3 W_sale 12 ham Myhammy 888-321-4567

Edit: the numpy where statement:

numpy is brought in by the line import numpy as np at the top. We are creating calculated column that is based on the 'Comp ID' column. The numpy does this without loops. Think of the where like an excel IF() function.

df1(return value) = np.where(df[test] > condition, true, false)

The pandas backfill Some times you have a value that is in a cell below and you want to duplicate it for the blank cell above it. So you backfill. df1 = df1[['Suite','City/State']].backfill().

Shane S
  • 1,747
  • 14
  • 31
  • I think this may work. Appreciate it! The file has over 10,000 rows. It would be impossible to create a dictionary manually. That is why I think I may need to create a loop. How should I create the dictionary more efficiently? – Michael J. Dec 08 '22 at 15:29
  • You don't need the dictionary. In your example you did not include a way for me to remake your data. So the dictionary is for me so I can provide an answer your example data. How are you loading the data into the pandas DataFrame? If you don't know then where is your data stored (.xlsx, .csv, html, SQL, parquet, etc.)? – Shane S Dec 08 '22 at 19:10
  • When you ask a question with data you should provide the data in a dictionary format, It allows people to answer quicker. – Shane S Dec 08 '22 at 19:12
  • I see. My data is a .xlsx file. – Michael J. Dec 08 '22 at 22:06
  • 1
    So you will need to follow the documentation methods for [read_excel()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html). Based on the conditions that you have it can be as simple as `df = pd.read_excel('c:/path/to/file.xlsx', sheet_name='sheet1')` – Shane S Dec 08 '22 at 22:33
  • 1
    Got you! City/State worked. But the Suite is all filled with NaN. Could you explain those three lines? I don't understand them. df1['Suite'] = np.where( df['Comp ID']=='', df['Address'], np.nan) df1['City/State'] = np.where( df['Comp ID'].isna(), df['Address'], np.nan) # copy values to rows above df1 = df1[['Suite','City/State']].backfill() Thanks! – Michael J. Dec 09 '22 at 21:14
  • 1
    Look at the image for your question then read my comments above my code. I don't know why you have blanks and NaN on different rows. In this line of code `df1['Suite'] = np.where( df['Comp ID']=='', df['Address'], np.nan)` it should appear blank but not actually be blank. I added a test above you can do prior `df1 = pd.DataFrame()`. – Shane S Dec 10 '22 at 06:24
  • I'll explain the numpy where above. – Shane S Dec 10 '22 at 06:29
  • I double-checked my df. I was mistaken. The rows with the suite information contain NaN as well, not blanks. I tried "df1['Suite'] = np.where( df['Comp ID'].isna(), df['Address'], np.nan) df1['City/State'] = np.where( df['Comp ID'].isna(), df['Address'], np.nan)" . It didn't work. Got duplicate City/States or Suites in both newly created Suite and City/State columns. I am thinking to use str.len() as a condition, since suites would be much shorter and City/States. how do I accomplish this? – Michael J. Dec 14 '22 at 20:54
  • The `.shift()` allows you to see the next cell and calculated based on its value. Documentation for [.shift()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html) ... I would not state a value is a suite based on number of characters but it might work. I will add code that shows shift() inside the `np.where()`. – Shane S Dec 14 '22 at 22:47
  • There is a problem with `df['Comp ID'].shift(periods=-1).isna()`. It does not work on the last value. Why don't you look at other questions that have answered this one already. [example 1](https://stackoverflow.com/questions/48712081/splitting-a-single-line-address-into-its-constituent-parts-im-thinking-reg-ex), [example 2](https://stackoverflow.com/questions/11160192/how-to-parse-freeform-street-postal-address-out-of-text-and-into-components) – Shane S Dec 14 '22 at 23:21
  • [example 3](https://stackoverflow.com/questions/55105280/how-to-extract-apartment-from-address-in-pandas) – Shane S Dec 14 '22 at 23:23
  • I will study them! Thanks a bunch! Really appreciate your help! – Michael J. Dec 15 '22 at 19:31