Python. Pandas. BigData. Messy TSV file. How to wrangle the data?

Question

So. We have a messy data stored in a TSV file that I need to analyse. This is how it looks

status=200  protocol=http   region_name=Podolsk datetime=2016-03-10 15:51:58    user_ip=0.120.81.243    user_agent=Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36    user_id=7885299833141807155 user_vhost=tindex.ru    method=GET  page=/search/

And the problem is that some of the rows have different column order / some of them missing values and I need to get rid of that with high performance (since the datasets I am working with are up to 100 Gigabytes).

Data = pd.read_table('data/data.tsv', sep='\t+',header=None,names=['status', 'protocol',\
                                                     'region_name', 'datetime',\
                                                     'user_ip', 'user_agent',\
                                                     'user_id', 'user_vhost',\
                                                     'method', 'page'], engine='python')
Clean_Data = (Data.dropna()).reset_index(drop=True)

Now I got rid of missing values but one problem still remains! This is how the data looks:

And this is how the problem looks:

As you can see some of columns are offset. I made a very low-performance solution

ids = Clean_Data.index.tolist()
for column in Clean_Data.columns:
    for row, i in zip(Clean_Data[column], ids):
        if np.logical_not(str(column) in row):
            Clean_Data.drop([i], inplace=True)
            ids.remove(i)

So now the data looks good... at least I can work with it! But what is the High-Performance ALTERNATIVE to the method I made above?

Update on unutbu code: traceback error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-52c9d76f9744> in <module>()
      8     df.index.names = ['index', 'num']
      9 
---> 10     df = df.set_index('field', append=True)
     11     df.index = df.index.droplevel(level='num')
     12     df = df['value'].unstack(level=1)

/Users/Peter/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in set_index(self, keys, drop, append, inplace, verify_integrity)
   2805             if isinstance(self.index, MultiIndex):
   2806                 for i in range(self.index.nlevels):
-> 2807                     arrays.append(self.index.get_level_values(i))
   2808             else:
   2809                 arrays.append(self.index)

/Users/Peter/anaconda/lib/python2.7/site-packages/pandas/indexes/multi.pyc in get_level_values(self, level)
    664         values = _simple_new(filled, self.names[num],
    665                              freq=getattr(unique, 'freq', None),
--> 666                              tz=getattr(unique, 'tz', None))
    667         return values
    668 

/Users/Peter/anaconda/lib/python2.7/site-packages/pandas/indexes/range.pyc in _simple_new(cls, start, stop, step, name, dtype, **kwargs)
    124                 return RangeIndex(start, stop, step, name=name, **kwargs)
    125             except TypeError:
--> 126                 return Index(start, stop, step, name=name, **kwargs)
    127 
    128         result._start = start

/Users/Peter/anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in __new__(cls, data, dtype, copy, name, fastpath, tupleize_cols, **kwargs)
    212             if issubclass(data.dtype.type, np.integer):
    213                 from .numeric import Int64Index
--> 214                 return Int64Index(data, copy=copy, dtype=dtype, name=name)
    215             elif issubclass(data.dtype.type, np.floating):
    216                 from .numeric import Float64Index

/Users/Peter/anaconda/lib/python2.7/site-packages/pandas/indexes/numeric.pyc in __new__(cls, data, dtype, copy, name, fastpath, **kwargs)
    105             # with a platform int
    106             if (dtype is None or
--> 107                     not issubclass(np.dtype(dtype).type, np.integer)):
    108                 dtype = np.int64
    109 

TypeError: data type "index" not understood

Pandas version : 0.18.0-np110py27_0

Update

Everything worked... Thanks everybody!

unutbu · Accepted Answer · 2016-05-08T01:55:59.447

Suppose you had TSV data such as this:

status=A    protocol=B  region_name=C   datetime=D  user_ip=E   user_agent=F    user_id=G
user_id=G   status=A    region_name=C   user_ip=E   datetime=D  user_agent=F    protocol=B
protocol=B      datetime=D  status=A    user_ip=E   user_agent=F    user_id=G

The order of the fields may be scambled, and there may be missing values. However, you don't have to drop rows just because the fields don't appear in a certain order. You can use the field names provided in the row data itself to place the values in the correct columns. For example,

import pandas as pd

df = pd.read_table('data/data.tsv', sep='\t+',header=None, engine='python')
df = df.stack().str.extract(r'([^=]*)=(.*)', expand=True).dropna(axis=0)
df.columns = ['field', 'value']

df = df.set_index('field', append=True)
df.index = df.index.droplevel(level=1)
df = df['value'].unstack(level=1)

print(df)

yields

field datetime protocol region_name status user_agent user_id user_ip
index                                                                
0            D        B           C      A          F       G       E
1            D        B           C      A          F       G       E
2            D        B        None      A          F       G       E

To handle a large TSV file, you could process rows in chunks, and then concatenate the processed chunks into one DataFrame at the end:

import pandas as pd

chunksize =     # the number of rows to be processed per iteration
dfs = []
reader = pd.read_table('data/data.tsv', sep='\t+',header=None, engine='python',
                       iterator=True, chunksize=chunksize)
for df in reader:
    df = df.stack().str.extract(r'([^=]*)=(.*)', expand=True).dropna(axis=0)
    df.columns = ['field', 'value']
    df.index.names = ['index', 'num']

    df = df.set_index('field', append=True)
    df.index = df.index.droplevel(level='num')
    df = df['value'].unstack(level=1)
    dfs.append(df)

df = pd.concat(dfs, ignore_index=True)
print(df)

Explanation: Given df:

In [527]: df = pd.DataFrame({0: ['status=A', 'user_id=G', 'protocol=B'],
 1: ['protocol=B', 'status=A', 'datetime=D'],
 2: ['region_name=C', 'region_name=C', 'status=A'],
 3: ['datetime=D', 'user_ip=E', 'user_ip=E'],
 4: ['user_ip=E', 'datetime=D', 'user_agent=F'],
 5: ['user_agent=F', 'user_agent=F', 'user_id=G'],
 6: ['user_id=G', 'protocol=B', None]}); df
   .....:    .....:    .....:    .....:    .....:    .....:    .....: 
Out[527]: 
            0           1              2           3             4             5           6
0    status=A  protocol=B  region_name=C  datetime=D     user_ip=E  user_agent=F   user_id=G
1   user_id=G    status=A  region_name=C   user_ip=E    datetime=D  user_agent=F  protocol=B
2  protocol=B  datetime=D       status=A   user_ip=E  user_agent=F     user_id=G        None

you can coalesce all the values into a single column

In [449]: df.stack()
Out[449]: 
0  0         status=A
   1       protocol=B
   2    region_name=C
   3       datetime=D
   4        user_ip=E
   5     user_agent=F
   6        user_id=G
1  0        user_id=G
   1         status=A
   2    region_name=C
   3        user_ip=E
   4       datetime=D
   5     user_agent=F
   6       protocol=B
2  0       protocol=B
   1       datetime=D
   2         status=A
   3        user_ip=E
   4     user_agent=F
   5        user_id=G
dtype: object

and then apply .str.extract(r'([^=]*)=(.*)') to separate the field name from the value:

In [450]: df = df.stack().str.extract(r'([^=]*)=(.*)', expand=True).dropna(axis=0); df
Out[450]: 
               0  1
0 0       status  A
  1     protocol  B
  2  region_name  C
  3     datetime  D
  4      user_ip  E
  5   user_agent  F
  6      user_id  G
1 0      user_id  G
  1       status  A
  2  region_name  C
  3      user_ip  E
  4     datetime  D
  5   user_agent  F
  6     protocol  B
2 0     protocol  B
  1     datetime  D
  2       status  A
  3      user_ip  E
  4   user_agent  F
  5      user_id  G

To make it easier to reference parts of the DataFrame, let's give the columns and index levels descriptive names:

In [530]: df.columns = ['field', 'value']; df.index.names = ['index', 'num']; df
Out[530]: 
                 field value
index num                   
0     0         status     A
      1       protocol     B
...

Now if we move the field column into the index:

In [531]: df = df.set_index('field', append=True); df
Out[531]: 
                      value
index num field            
0     0   status          A
      1   protocol        B
      2   region_name     C
      3   datetime        D
...

and drop the num index level:

In [532]: df.index = df.index.droplevel(level='num'); df
Out[532]: 
                  value
index field            
0     status          A
      protocol        B
      region_name     C
      datetime        D
...

then we can obtain a DataFrame of the desired form by moving the field index level into the column index:

In [533]: df = df['value'].unstack(level=1); df
Out[533]: 
field datetime protocol region_name status user_agent user_id user_ip
index                                                                
0            D        B           C      A          F       G       E
1            D        B           C      A          F       G       E
2            D        B        None      A          F       G       E

Could you please comment your code so I (and everybody) can follow you completely? Hm it seems that iPython works really slow with that code. Actually it got stuck at `df = df['all'].str.extract('\t'.join(['(.*)']*(n+1)), expand=True)` row... any ideas? — Petr Fedosov, May 07 '16 at 00:51
I've changed the code (so it no longer uses that line) and added a section on how to read a TSV file in chunks. If the chunksize is not too large you should be able to see results more quickly. Moreover, perhaps you do not need to form one huge DataFrame; perhaps you can process the TSV in chunks iteratively. — unutbu, May 07 '16 at 02:08
`df = df.set_index('field', append=True)` throws `TypeError: data type "index" not understood`... You can try out my [Data file (just 30Mb)](https://drive.google.com/open?id=0BxWmwAIo1D_nUUpqTU5TdDlaUHc)... Actually I have no clue why. Seems to be a pandas bug — Petr Fedosov, May 07 '16 at 23:59
I downloaded `data.tsv`, ran my code on it but was unable to reproduce the `TypeError`. I'm using pandas version 0.18.0. Please post (in your question, not the comments section) the full traceback error message. It will tell us useful things like the exact line on which the `TypeError` was raised. — unutbu, May 08 '16 at 00:33
The problem may be related to https://github.com/pydata/pandas/issues/13000. I've modified the code so as to avoid reassigning `df.index.names`. Perhaps see if that avoids the `TypeError`. — unutbu, May 08 '16 at 01:53
Yeah I noticed that thread on github too. Now the code works fine. Thank you. I mean, really. Never expected that level of help. — Petr Fedosov, May 08 '16 at 02:05

ali_m · Answer 2 · 2016-05-08T00:25:59.617

4

You can use Pandas' vectorized string operations, specifically str.contains:

import numpy as np

# boolean index of rows to keep
is_valid = np.ones(Clean_data.shape[0], np.bool)

for column in Clean_Data.columns:

    # check whether rows contain this column name
    is_valid &= Clean_Data[column].str.contains(column)

# drop rows where not all elements contain their respective column names
Clean_Data.drop(np.where(~is_valid)[0], inplace=True)

edited May 08 '16 at 00:25

answered May 06 '16 at 23:45

ali_m

71,714
23
223
298

`--------------------------------------------------------------------------- ValueError Traceback (most recent call last) in () 7 8 # check whether rows contain this column name ----> 9 is_valid &= Clean_Data[column].str.contains(column) 10 11 # drop rows where not all elements contain their respective column names ValueError: operands could not be broadcast together with shapes (10,) (75618,) (10,) ` – Petr Fedosov May 08 '16 at 00:04
Sorry, there was a typo in the initialization of `is_valid`. Try now. – ali_m May 08 '16 at 00:27

score 3 · Answer 3 · edited May 23 '17 at 12:23

I can't add comments, so I'll post this as a reply (In reality it is a comment in response to your comment about memory usage and runtime).

One thing that you need to consider, for large files (100GB), is that you are not going to read these files into memory. You can set the chunk size for pandas “Large data” work flows using pandas or How to read a 6 GB csv file with pandas, or use the yield generator with the csv module and read the files row/line by line. Reading a huge .csv in python

Incorporating @unutbu's comment about using regex to sort the entries into columns, given that the fieldnames are so clearly demarcated for each cell (i.e. r'(.*)=(.*)' is all that is required - although there may be some error correction needed) should be all that you need (also, as they say, dropping entire rows because of some missing data is not a typical, or recommended, approach).

Python. Pandas. BigData. Messy TSV file. How to wrangle the data?

3 Answers3