0

Firstly, I have load the data by:

import urllib.request
f = urllib.request.urlretrieve("https://www.dropbox.com/s/qz62t2oyllkl32s/kddcup.data_10_percent.gz?dl=1", "kddcup.data_10_percent.gz")


data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)

Then, I created a list of required data by:

import numpy as np
import pandas as pd

def parse_interaction(line):
    line_split = line.split(",")
    # keep just numeric and logical values
    symbolic_indexes = [1,2,3,41]  # in the above sample would be: tcp,http,SF,normal
    clean_line_split = [item for i,item in enumerate(line_split) if i not in symbolic_indexes]
    return np.array([x for x in clean_line_split], dtype=float)

vector_data = raw_data.map(parse_interaction)

Now, I can see data by vector_data.take(2):

[array([0.00e+00, 1.81e+02, 5.45e+03, 0.00e+00, 0.00e+00, 0.00e+00,
        0.00e+00, 0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
        0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
        0.00e+00, 8.00e+00, 8.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
        0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00, 9.00e+00, 9.00e+00,
        1.00e+00, 0.00e+00, 1.10e-01, 0.00e+00, 0.00e+00, 0.00e+00,
        0.00e+00, 0.00e+00]),
 array([0.00e+00, 2.39e+02, 4.86e+02, 0.00e+00, 0.00e+00, 0.00e+00,
        0.00e+00, 0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
        0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
        0.00e+00, 8.00e+00, 8.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
        0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00, 1.90e+01, 1.90e+01,
        1.00e+00, 0.00e+00, 5.00e-02, 0.00e+00, 0.00e+00, 0.00e+00,
        0.00e+00, 0.00e+00])]

I want to convert it into DataFrame with vector_data = pd.DataFrame(vector_data), but the commands are not working and I am getting error, as:

ValueError           Traceback (most recent call last)
<ipython-input-112-6a2dcc5bdb85> in <module>()
     10 
     11 vector_data = raw_data.map(parse_interaction)
---> 12 vector_data = pd.DataFrame(vector_data)

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    420                                          dtype=values.dtype, copy=False)
    421             else:
--> 422                 raise ValueError('DataFrame constructor not properly called!')
    423 
    424         NDFrame.__init__(self, mgr, fastpath=True)

ValueError: DataFrame constructor not properly called!

I know that the input vector is in special format and I need to add something into DataFrame command to work properly. Please, guide me on that how can I make a DataFrame on that.

BlueBit
  • 397
  • 6
  • 22
  • What is `raw_data.map`? – roganjosh Sep 26 '18 at 20:10
  • @roganjosh, I just added all the codes – BlueBit Sep 26 '18 at 20:15
  • `pd.DataFrame({'vector1': vector_data[0], 'vector2': vector_data[1]})` – Khalil Al Hooti Sep 26 '18 at 20:25
  • which one do you want to have, dataframe with 2 columns or 38 columns? – ipramusinto Sep 26 '18 at 20:28
  • @bakka with 38 columns. – BlueBit Sep 26 '18 at 20:41
  • `vector_data` is an `rdd` right? And you want a pandas DataFrame? Or do you want a spark DataFrame? – pault Sep 26 '18 at 20:55
  • @pault, I wana create a DataFrame of my data, does not matter with Panda or with PySpark. Because, I have to firstly bring out my required data from the main dataset, then convert it to DataFrame for MLlib purpose – BlueBit Sep 26 '18 at 20:58
  • In that case I think you are looking for: [Spark RDD to DataFrame python](https://stackoverflow.com/questions/39699107/spark-rdd-to-dataframe-python). As an aside, it very much **does** matter if it's Pandas or PySpark. The two are very different, and though it is possible to convert between the two, you're going to run into a lot of slow code if you try to use them interchangeably.. – pault Sep 26 '18 at 20:59

1 Answers1

0

You can use from_records():

vector_data = [np.array(...), np.array(...)]
pd.DataFrame.from_records(vector_data)
Andy
  • 450
  • 2
  • 8