0

I have seen some posts about this matter before but I really can't find a solution that solves my problem.

I got an array with 1743 elements which is loaded from a txt file. Each element is a string '1,1232,3,2018-03-24' where the structure is movieID,customerID,rating,date.

I am rather new to python but I do know that I want this as a dataframe with the column name as followed in the structure of the string.

I have trouble with converting this into a dataframe. I was thinking of trying to write the array elements to a file and then load it into a dataframe from that file but I am very much aware that this is really time consuming and the total dataset from the file are over 24 million rows.

UPDATE------------->

I have now been able to split the string into 4 elements containing movieID, customerID, rating, date

now there is only the problem to put it into the dataframe correctly. Below is all code I have and the result can be shown below

movieID = ''
names =['customerID', 'movieID', 'rating']
data = pd.DataFrame(columns = names)
tf = False


for line in file:
     tf = False
     line = line.strip('\n')
     if(line[len(line)-1] == ':'):
          movieID = line.strip(':')
          tf = True

     if(tf != True):
          line = movieID + ',' + line
          text = line.split(',')
          df = pd.DataFrame([text[1], text[0], text[2]], columns = names)
          data = data.append(df)
          tf = False


print(data)




ValueError: Shape of passed values is (1, 3), indices imply (3, 3) at df = pd.DataFrame([text[1], text[0], text[2]], columns = names)
  • 2
    Does this answer your question maybe: https://stackoverflow.com/questions/21546739/load-data-from-txt-with-pandas – FatihAkici Apr 06 '18 at 14:53
  • Not really. When I load the data from the file it do not have the structure that is in the array. The file has the structure like this: 1: 2131,2018-02-24\n4512,2017-02-25 where '1:' indicates the movie and the rest that follows each customers rating – Anton Gustafsson Drevinum Apr 06 '18 at 14:56
  • If you can't read directly from the file, you may be able to use something like [`DataFrame.from_records`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_records.html). – 0x5453 Apr 06 '18 at 14:57
  • 2
    Try a solution from one of the many similar questions on SO. In a few hours, if still stuck, please come back and we're here to help. I recommend you post some code, a sample (first few rows of data), what error you get. – jpp Apr 06 '18 at 14:58
  • I will try DataFrame.from_records, it seems to be exactly the function I was looking for – Anton Gustafsson Drevinum Apr 06 '18 at 15:02
  • 2
    The issue is that you have multiple delimiters (`:` and `,`) - you can try the approach from [this answer](https://stackoverflow.com/a/26551913/5858851). Something like: `pd.read_csv('filename.txt', names=['movieID','customerID','rating','date'], sep=',|:', engine='python')` – pault Apr 06 '18 at 15:02

0 Answers0