1

I'm trying to filter through a CSV and make a new CSV which is the exact same except for it gets rid of any rows that have a value of greater than 100 billion in the 'marketcap' column.

The code I've written so just spits out the same CSV as the original out over again and doesn't cut out any lines from the old CSV to the new CSV.

Code:

db = pd.read_csv('SF1_original.csv')
db = db[db['marketcap']<= 100000000000]
db.to_csv('new_SF1_original.csv')

Example of old CSV (It's long don't look through whole thing, just to give you an idea):

ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000

Example New CSV (Exact same when this line should have been cut):

,ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
0,A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000.0,7107000000.0,,4982000000.0,2125000000.0,,10.219,-30000000.0,1368000000.0,1368000000.0,1160000000.0,131000000.0,2.41,0.584,665000000.0,111000000.0,554000000.0,665000000.0,281000000.0,96000000.0,0.0,0.0,0.0,202000000.0,298000000.0,0.133,298000000.0,202000000.0,202000000.0,0.3,0.3,0.3,4486000000.0,,4486000000.0,50960600000.0,,,354000000.0,0.8059999999999999,1.0,1086000000.0,0.484,0.0,0.0,4337000000.0,,1567000000.0,42000000.0,42000000.0,0.0,2621000000.0,2067000000.0,554000000.0,51663600000.0,1368000000.0,-160000000.0,2068000000.0,111000000.0,0.0,1192000000.0,-208000000.0,-42000000.0,384000000.0,0.0,131000000.0,131000000.0,131000000.0,0.0,0.0,0.057999999999999996,915000000.0,171000000.0,635000000.0,0.0,11.517000000000001,,,1408000000.0,0.0,114.3,,,1445000000.0,131000000.0,2246000000.0,2246000000.0,290000000.0,,,,,0.0,625000000.0,1.0,452000000.0,439000000.0,440000000.0,5.1160000000000005,7107000000.0,0.0,71000000.0,113000000.0,16.189,2915000000.0

I've seen two questions somewhat related to this on StackOverflow, but they haven't helped me much. This one uses CSV library instead of pandas (which is an option for me). This one is more helpful since it uses pandas but still hasn't been interacted with and isn't exactly the same as my use case.

George Adams
  • 331
  • 4
  • 15
  • 1
    You might need to convert `marketcap` column to integer before filtering. Try something like this : `db = db.astype({'marketcap': int})` – Amine Messaoudi Nov 30 '20 at 15:36
  • 1
    I tried that and I get this error: ValueError: ```Cannot convert non-finite values (NA or inf) to integer``` – George Adams Nov 30 '20 at 15:42
  • try using `np.where` https://stackoverflow.com/q/53040174 – Jacob Myer Nov 30 '20 at 16:15
  • I'm confused as to how and where I would implement that in my code. ```db['marketcap'} = np.where(db['marketcap'] == ``` and then I'm not too sure what to make it equal to. In the question link that was sent, .where is attempting to change the value of cells, so how would I change them to integers so that I can work with them? – George Adams Nov 30 '20 at 16:25
  • @george.adams1 its because there are rows with empty columns. You can remove them by calling `db.dropna(inplace=True)` after reading the dataframe – Amine Messaoudi Nov 30 '20 at 16:46
  • @AmineMessaoudi how do I make that specific to the 'marketcap' column? – George Adams Nov 30 '20 at 16:48
  • 1
    @george.adams1 Using this : `db.dropna(subset=['marketcap'], inplace=True)` – Amine Messaoudi Nov 30 '20 at 16:50
  • @AmineMessaoudi I tried that, my code was: ```db = db.dropna(subset=['marketcap'], inplace=True) db = db.astype({'marketcap':int})``` I got an error for the astype part: ```AttributeError: 'NoneType' object has no attribute 'astype'``` I looked up this link but it doesn't seem to help: https://stackoverflow.com/questions/57273464/attributeerror-nonetype-object-has-no-attribute-astype. – George Adams Nov 30 '20 at 17:06
  • `db.dropna(subset=['marketcap', inplace=True]` without assigning – Amine Messaoudi Nov 30 '20 at 20:30
  • @AmineMessaoudi I'm not entirely sure what that means. There is no second () in that line of code. Can you clarify where the [ ] and ( ) go? – George Adams Nov 30 '20 at 21:35
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/225338/discussion-between-george-adams1-and-amine-messaoudi). – George Adams Nov 30 '20 at 21:45

1 Answers1

1

You can get the indexes of the rows with "marketcap" over 100 billion rows like so:

df.loc[df["marketcap"] > 100000000000]["marketcap"].index

All that's left to do is drop them from the DataFrame:

df.drop(df.loc[df["marketcap"] > 100000000000]["marketcap"].index, inplace=True)

Reading from CSV and writing to the CSV is already correctly taken care of in your code.

gmdev
  • 2,725
  • 2
  • 13
  • 28
  • The df that you wrote are supposed to be db right? – George Adams Dec 16 '20 at 22:13
  • @george.adams1 yes – gmdev Dec 16 '20 at 23:00
  • @gmdevI keep getting a recurring issue of not being able to turn the cells in the file into integers in order to use the code that you showed above in order to drop the lines over 100billion in the csv. ```x['marketcap'] = [['marketcap']] = db.apply(pd.to_numeric(db)) db = db.astype({'marketcap':int})``` I keep getting the error ```AttributeError: 'NoneType' object has no attribute 'astype'``` – George Adams Dec 16 '20 at 23:12
  • @george.adams1 do you have missing values in your CSV? – gmdev Dec 16 '20 at 23:21
  • I do but I already used ```db = db.dropna(subset=['marketcap'], inplace=True)``` to get rid of them – George Adams Dec 17 '20 at 18:31
  • @george.adams1 what's the reason `for x=y=z` (`x['marketcap'] = [['marketcap']] = db.apply(pd.to_numeric(db))`)? – gmdev Dec 17 '20 at 21:04