1

I apologize for the neophyte question, but I'm having a hard time figuring out Pandas' data frames. I have one data frame with something like

df_index:
Product    Title
100000     Sample main product
200000     Non-consecutive main sample

I have another data frame with a more detailed list of the products with formats, like

df_details:
Product                    Title
100000                    Sample main product
100000-Format-English     Sample product details
100000-Format-Spanish     Sample product details
100000-Format-French      Sample product details
110000                    Another sample main product
110000-Format-English     Another sample details
110000-Format-Spanish     Another sample details
120000                    Yet another sample main product
120000-Format-English     Yet another sample details
120000-Format-Spanish     Yet another sample details
...
200000                    Non-consecutive main sample
200000-Format-English     Non-consecutive sample details
200000-Format-Spanish     Non-consecutive sample details

I want to create a new data frame based on df_details, but only for the products that appear in df_index. Ideally, it would look something like:

new_df:
Product                    Title
100000                    Sample main product
100000-Format-English     Sample product details
100000-Format-Spanish     Sample product details
100000-Format-French      Sample product details
200000                    Non-consecutive main sample
200000-Format-English     Non-consecutive sample details
200000-Format-Spanish     Non-consecutive sample details

Here's what I've tried so far:

new_df = df_details[df_details['Product'][0:5] == df_index['Product'][0:5]]

That gives me a error:

ValueError: Can only compare identically-labeled Series objects

I've also tried

new_df = pd.merge(df_index, df_details, 
  left_on=['Product'[0:5]], right_index=True, how='left')

Which does give me a resulting data set, but not the kind I want—it doesn't include the details rows with the format information.

Fabio Lamanna
  • 20,504
  • 24
  • 90
  • 122
nathan.hunt
  • 117
  • 2
  • 8

2 Answers2

2

You should be able to use .isin() as:

new_df = df_details[df_details['Product'].isin(df_index['Product']]

This will perform a mask looking up only the common indices.

EDIT: this works only whether the column has the same string. To solve this you can use str.contains() with:

import re

# create a pattern to look for
pat ='|'.join(map(re.escape, df_index['Product']))

# Create the mask
new_df = df_details[df_details['Product'].str.contains(pat)]

This works if the column is formatted as string.

Fabio Lamanna
  • 20,504
  • 24
  • 90
  • 122
  • Nice. was just writing the same solution. – Ted Petrou Dec 22 '16 at 18:13
  • That does work, sort of. It doesn't give me the 200000-Format-English style rows, though. Possibly because it's not an exact match for the rows that look like 200000? – nathan.hunt Dec 22 '16 at 18:25
  • 1
    @nathan.hunt yes you're right, I thought all rows have the same format...this performs a lookup on 'same-format' rows... thinking on a more general solution.. – Fabio Lamanna Dec 22 '16 at 18:40
  • 1
    This should help http://stackoverflow.com/questions/27975069/how-to-filter-rows-containing-a-string-pattern-from-a-pandas-dataframe – Fabio Lamanna Dec 22 '16 at 18:54
  • @FabioLamanna Thanks so much for your help! I did use `re.escape`, and it worked, but for some reason I can't figure out, it also gave me some rows I wasn't expecting… and which weren't found in df_index, as near as I can tell. I did figure out how to solve the problem with `for` and `if` loops with pandas' `.iterrow()`, though, so I'll post that. – nathan.hunt Dec 23 '16 at 17:32
0

Here's how I managed to pull this off—I'm sure it's not pretty, or the fastest way to accomplish it, but it does work.

I used pandas' .itterow() with some for and if loops to go through the data frame row-by-row:

# create a list based on the 'Product' column of df_index
increment = 0
index_list = []
for product, row in df_index.iterrows():
    prod_num = df_index.product.iloc[increment]
    index_list.append(prod_num)
    increment += 1

#construct a new data frame based on the rows in df_details that are found in index_list
new_df = pd.DataFrame(columns=detail_df.columns)
increment_detail = 0
for product, row in df_details.iterrows():
    prod_num_detail = df_details.product.iloc[increment_detail]
    prod_num_detail = prod_num_detail[0:6]
    if str(prod_num_detail) in dupe_list:
        new_df = new_df.append(df_details.iloc[increment_detail])
        increment_detail += 1
    else:
        increment_detail += 1
nathan.hunt
  • 117
  • 2
  • 8