3

Update at end Update 2 at end

I read from here: get list from pandas dataframe column

Pandas DataFrame columns are Pandas Series when you pull them out

However this is not true in my case:

First part (building up the DataFrame reading json scraped) Because it contains business info I cannot show the full code, but basically it reads one row of data (stored in Series) and append at the end of the DataFrame.

dfToWrite = pandas.DataFrame(columns=[lsHeader]) # Empty with column headers
for row in jsAdtoolJSON['rows']:
    lsRow = []
    for col in row['row']:
        lsRow.append((col['primary'])['value'])
    dfRow = pandas.Series(lsRow, index = dfToWrite.columns)
dfToWrite = dfToWrite.append(dfRow, ignore_index = True)

Next part (check type): (Please ignore the functionality of the function)

def CalcMA(df: pandas.DataFrame, target: str, period: int, maname: str):
    print(type(df[target]))

Finally call the function: ("Raw_Impressions" is a column header)

CalcMA(dfToWrite, "Raw_Impressions", 5, "ImpMA5")

Python console shows:

class 'pandas.core.frame.DataFrame'

Additional Question: How to get a list from a Dataframe column if it's not a Series (in which case I can use tolist())?

Update 1 From here: Bokeh: AttributeError: 'DataFrame' object has no attribute 'tolist'

I figured out that I need to use .value.tolist(), however it still doesn't explain why I'm getting another Dataframe, not a Series when I pull out a column.

Update 2 Found out that df has MultiIndex, very surprised:

MultiIndex(levels=[['COST_/CPM', 'CTR', 'ECPM/_ROI', 'Goal_Ratio', 'Hour_of_the_Day', 'IMP./Joins', 'Raw_Clicks_/_Unique_Clicks', 'Raw_Impressions', 'Unique_Goal_/_UniqueGoal_Forecasted_Value']], labels=[[4, 7, 5, 6, 1, 8, 3, 0, 2]])

I don't see the labels when printing out the df / writing to .csv, it's just a normal DataFrame. Not sure where did I get the labels.

Nicholas Humphrey
  • 1,220
  • 1
  • 16
  • 33

2 Answers2

7

I think you have duplicated columns names, so if want select Series get DataFrame:

df = pd.DataFrame([[1,2],[4,5], [7,8]], index=list('aab')).T
print (df)
   a  a  b
0  1  4  7
1  2  5  8

print (df['a'])
   a  a
0  1  4
1  2  5

print (type(df['a']))
<class 'pandas.core.frame.DataFrame'>

print (df['b'])
0    7
1    8
Name: b, dtype: int64

print (type(df['b']))
<class 'pandas.core.series.Series'>

EDIT:

Here is another problem, one level MultiIndex, solution is reassign first level back to columns with get_level_values:

mux = pd.MultiIndex([['COST_/CPM', 'CTR', 'ECPM/_ROI', 'Goal_Ratio', 'Hour_of_the_Day', 
                      'IMP./Joins',  'Raw_Clicks_/_Unique_Clicks', 'Raw_Impressions',
                      'Unique_Goal_/_UniqueGoal_Forecasted_Value']], 
labels=[[4, 7, 5, 6, 1, 8, 3, 0, 2]])

df = pd.DataFrame([range(9)], columns=mux)
print (type(df['CTR']))
<class 'pandas.core.frame.DataFrame'>

df.columns = df.columns.get_level_values(0)
print (type(df['CTR']))
<class 'pandas.core.series.Series'>
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks @jezrael, just printed out the columns and found out that actually I have a MultiIndex, does it cause the issue? (The `levels` do not have duplicates) It's very strange, because when I `print(df)` it doesn't show any of the labels. I'll update the question with labels. – Nicholas Humphrey Jan 08 '19 at 06:30
  • 1
    @NicholasHumphrey - yes, if MultiIndex then it means duplicated first level :) – jezrael Jan 08 '19 at 06:31
  • thanks, finally found the source of the problem. BTW I should not have MultiIndex so I'll figure out a way to find the source of that. – Nicholas Humphrey Jan 08 '19 at 06:33
  • 1
    @NicholasHumphrey - Added solution for your situtation. – jezrael Jan 08 '19 at 06:39
  • 1
    thanks! I'll take a look in the morning but I think it will work. Now next step is to track down the source of MultiIndex... – Nicholas Humphrey Jan 08 '19 at 06:48
  • 1
    @NicholasHumphrey - yes, this kind of error is very unpleasent, especially because not seen if print DataFrame. – jezrael Jan 08 '19 at 06:49
1

Each instance of pandas.core.frame.DataFrame is basically an array so if you are getting this type you can get each column ( which if the column is 1 dimensional will be of type pandas.core.series.Series ) by calling df.columns.

df.columns will give you an iterable object that you can loop through to get your values along each row.

You might also want to look at pandas.read_json or other similar package just to get the json directly into a pandas object which might be easier to manage

anky
  • 74,114
  • 11
  • 41
  • 70
NiallJG
  • 1,881
  • 19
  • 22
  • Thanks @NiallJG I managed to use `df(target).values.tolist()` to get a list from a column. But it still confuses me why `df(target)`, in which `target` is just a string, does not represent a `Series` – Nicholas Humphrey Jan 08 '19 at 06:26
  • 1
    @jezrael 's answer suggests that maybe there are duplicate columns, try running print(df.columns) and see what the column headings are named, maybe there are two of the same string – NiallJG Jan 08 '19 at 06:29
  • thanks yeah I found out there is multiindex, I'll check the previous code to find the source of that. – Nicholas Humphrey Jan 08 '19 at 06:34