1

I have a piece of code I use to retrieve specific rows in a large usage data file for specific items which I write out as a list. For any matches in the DOI field the entire row is retrieved.

import pandas as pd
import numpy as np
from pandas import DataFrame

df_usage = pd.read_csv(r"usage_data.csv", encoding='latin-1')
article_usage = df.loc[df['DOI'] .isin(['DOI_1','DOI_2','DOI_3'])]

article_usage.to_csv(r"article_usage.csv")

However, I now need to retrieve a much larger amount of rows and so instead of writing out a list in my code with each item I want to change this to read a CSV containing all the items for which I need to retrieve rows in the data file.

df_usage = pd.read_csv(r"usage_data.csv", encoding='latin-1')
df_item_list = pd.read_csv(r"item_list.csv", encoding='latin-1', header=None)
article_usage = df_usage.loc[df_usage['DOI'] .isin(df_item_list)]

article_usage.to_csv(r"article_usage.csv")

The usage data file is a large CSV (100K+ rows and 53 columns with "DOI" being the column I am trying to match by).

           DOI      Jan    Feb   Mar
    0     DOI_1      27    123   345
    1     DOI_2      89    34    234
    2     DOI_3      65    56    102

The item list CSV is just a file containing each DOI I am trying to match.

        0
0     DOI_1
1     DOI_2
2     DOI_3

When I run the above the article_usage.csv file will only contain the data for the first item in my article list. How can i modify this code to retrieve data for all the items in the list? Any help would be greatly appreciated!

Lil_Joe
  • 35
  • 4
  • 1
    Your [mre] should always include a minimal example of the data. [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – wwii Mar 12 '21 at 14:54

1 Answers1

0

If you are trying to filter rows based on the item_list.csv you could try:

import pandas as pd
import numpy as np
from pandas import DataFrame

df = pd.read_csv(r"usage_data.csv", encoding='latin-1')
df_items = pd.read_csv(r"item_list.csv", encoding='latin-1', names=['DOI'])
article_usage = df.loc[df['DOI'] .isin(df_items['DOI'])]
article_usage.to_csv(r"article_usage.csv", index=False)

This assumes item_list.csv has no header and is just a list of required DOI entries.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97