1

I am preparing data for regression but i couldn't do it. I have to covert 2 rows likes and popularity to integer. how can i do it.

Unique_ID      int64
Genre          int64
Views          int64
Comments       int64
Likes         object
Popularity    object
Followers      int64
dtype: object

1.I did this:

df['Popularity']=df.Popularity.str.replace(',','').astype(int)

and error came

invalid literal for int() with base 10: '13.1K'

  1. then I tried this:
pd.to_numeric(df['Likes'], downcast='integer')

again error came

Unable to parse string "2,400" at position 3

  1. and this as well
df = df.astype(int)

invalid literal for int() with base 10: '2,400'

what can i do so that i can do regression to my data

M_S_N
  • 2,764
  • 1
  • 17
  • 38
Himanshu
  • 23
  • 1
  • 7
  • `df[column] = df[column].astype(int)` – talatccan Jan 11 '20 at 06:43
  • 2
    Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Jan 11 '20 at 06:43
  • could you post a sample of data columns you are trying to convert? – Subhashi Jan 11 '20 at 06:56
  • you can checkout the data from here https://www.machinehack.com/course/chartbusters-prediction-foretell-the-popularity-of-songs/ I was working on the training set. – Himanshu Jan 11 '20 at 12:46
  • @subhashi - I want to retain the information of K and M in my data , i dont want to remove them. I want to make a function which will automatically convert K and M in the respective positions with 1000 and 1000000. – Himanshu Jan 11 '20 at 13:18

1 Answers1

2

There may be some entries of format 13.1K so you should have to strip them from last K as well.

df['Property'] = df['Property'].str.replace(',','')
df['Property'] = df['Property'].str.rstrip('K')

If there are other characters as well like M strip them or use regex to find them and then convert them to float.

df['Property'] = df['Property'].astype('float64')

You can also do this to remove any alphabet from last as follows:

from string import ascii_letters
df['Property'] = df['Property'].str.rstrip(ascii_letters)

EDIT

As per OP's requirement asked in the comments,below solution will work.

Assuming original data-set have values like this:

0   13.1K
1   2,400
2   4555
3   6,1M
4   6.1M

Using following code

df['Property']=df['Property'].str.replace(',','')
df.['Property'] = (df.['Property'].replace(r'[KM]+$', '',regex=True).astype(float) * \
              df.['Property'].str.extract(r'[\d\.]+([KM]+)', expand=False)
                .fillna(1).replace(['K','M'], [10**3, 10**6]).astype(int))

Will transform the data as follows

0   13100.0
1   2400.0
2   4555.0
3   61000000.0
4   6100000.0
M_S_N
  • 2,764
  • 1
  • 17
  • 38
  • i dont want to strip them. I want to put 1000 where there is K and 1000000 where there is M to all my data. Thankyou – Himanshu Jan 11 '20 at 12:42
  • MemoryError: Unable to allocate array with shape (54920, 40658) and data type int64 ​then i tried echo 1 > /proc/sys/vm/overcommit_memory zsh: permission denied: /proc/sys/vm/overcommit_memory – Himanshu Jan 11 '20 at 18:51
  • `MemoryError` is out of scope of this question, but you should use `sudo` or login as `root` and it can do the trick. fe days back i literally ha this problem and these SO posts helped me: 1) https://stackoverflow.com/questions/57507832/unable-to-allocate-array-with-shape-and-data-type 2)https://stackoverflow.com/questions/57812453/memoryerror-unable-to-allocate-array-with-shape-and-data-type-object – M_S_N Jan 11 '20 at 20:04
  • tried the sudo method but it didn't work.i tried this sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches but still couldn't do anything.same error is coming – Himanshu Jan 12 '20 at 04:23
  • Depends on your ram, i do think after over commit the error figure `shape (54920, 40658)` would have gone down. but your memory is still unable to handle such large data set you can try reducing the size of your data set. – M_S_N Jan 12 '20 at 05:00
  • @Himanshu, consider green ticking the solution so this can be closed =) – M_S_N Jan 12 '20 at 05:18