1

I have a large dataset which I have imported using the read_csv as described below which should be float measurement and NaN.

df = pd.read_csv(file_,parse_dates=[['Date','Time']],na_values = ['No Data','Bad Data','','No Sample'],low_memory=False)    

When I apply df.dtypes, most of the columns return as object type which indicate that there are other objects in the dataframe that I am not aware of.I am looking for a way of identifying those string and replace then by na values.

First thing that I wanted to do was to convert everything to dtype = np.float but I couldn't. Then, I tried to read in each (columns,index) and return the identified string.

I have tried something very inefficient (I am a beginner) and time consuming, it has worked for other dataframe but here it returns a errors:

TypeError: argument of type 'float' is not iterable

from isstring import *
list_string = []
for i in range(0,len(df)):
for j in range(0,len(df.columns)):
    x = test.ix[i,j]
    if isstring(x) and '.'not in x:
        list_string.append(x)

list_string = pd.DataFrame(list_string, columns=["list_string"])
g = list_string.groupby('list_string').size()

Is there a simple way of detecting unknown string in large dataset. Thanks

Stefan
  • 41,759
  • 13
  • 76
  • 81
IngridM
  • 33
  • 1
  • 7

1 Answers1

0

You could try:

string_list = []
for col, series in df.items(): # iterating over all columns - perhaps only select `object` types
    string_list += [s for s in series.unique() if isinstance(s, str)]
Stefan
  • 41,759
  • 13
  • 76
  • 81
  • It has done the job.Thanks a lot. Question, += is it same as append? and the series next to col, what is it saying exactly? Am I right in saying that df.items() would return all columns in df as series? Thanks Stefan. – IngridM May 11 '16 at 10:06
  • You're welcome. The `+=` is more like `.extend()` - see http://stackoverflow.com/questions/252703/python-append-vs-extend, also in the comments on `+=` vs `.extend()`. The `series` will contain the `column` values resulting from `.items()`, which is a `python3` shorthand for `.iteritems()` http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iteritems.html. – Stefan May 11 '16 at 14:36
  • Did this answer your question or do you need any additional info? – Stefan May 12 '16 at 16:16
  • Yes Stephan, it has. I thought I had to minimize the comments. It is my first post so if I have to do something to give you points, tell me please?Thanks a lot – IngridM May 12 '16 at 17:29
  • No worries, just accept the answer if you think that's appropriate and upvote if you feel like it :) – Stefan May 12 '16 at 17:41
  • Ah ok sorry about that! I think I am fine now?am I? – IngridM May 12 '16 at 18:40
  • Well done. The point of accepting answers is mostly to signal a question has been closed so it doesn't keep popping up as 'unanswered'. – Stefan May 12 '16 at 21:11