0

I have a 2 column dataframe with the following values, unfotunately, since the list of values is too long, it gets chopped with , ...] while saving to a csv with to_csv, How can I retain the entire list while saving

attribute_name list_of_values attribute1 [1320698, 1320699, 1096323, 1320690, 1839190, 1091359, 1325750, 1569072, 1829679, 142100, 1320163, 1829673, 588914, 418137, 757085, 588910, 1321158, 1073897, 1823533, 1823535, 1091363, 1383908, 1834826, 36191, 1829641, 767536, 1829597, 1829591, 1326727, 1834700, 1317721, 1317802, 1834838, 52799, 1383915, 1320042, 1829654, 1829655, 1829658, 647089, 1829581, 1829586, 1829587, 1321116, 1829585, 1829588, 1839799, 1588509, 1834471, 1793632, 1327850, 1793599, 1456968, 1315869, 1793605, 1321236, 1829579, 1829577, 1793609, 1829571, 1829570, 1320139, 777057, 1829671, 1829566, 1831047, 1829567, 588927, 60484, 1793596, 1829634, 1839580, 1829569, 1793615, 1323529, 1793619, 1834758, 1612974, 1320007, 1839780, 1291475, 1834835, 1834453, 1823663, 418112, 1092106, 1829689, 1829688, 1793606, 647050, 1834742, 1839551, 1839553, 1834746, 1839556, 1834745, 1575978, 1834749, 1320711, 1317910, ...]

df.to_csv(loc,index=False,header=False,sep='\t',mode='a',encoding='utf8'). I tried the display options here, http://pandas.pydata.org/pandas-docs/dev/options.html, with pd.set_option('max_colwidth',20000000000), but I think since it works only on display mode, not on dumping to a csv, this does not work.

What else can I set, to retain the contents of the entire list.

Edit -: Try creating a dataframe with this orignal data, once you save this, it will give you the distorted data as pointed above.

    import pandas as pd
     pd.options.display.multi_sparse = False
     pd.set_option('max_colwidth',2000000000000000000)
     headers=["attribute_name", "list_of_values"]
     file_name='/home/ekta/abcd.csv'
     data = ['attribute1', ['1320698', '1320699', '1096323', '1320690', '1839190', '1091359', '1325750', '1569072', '1829679', '142100', '1320163', '1829673', '588914', '418137', '757085', '588910', '1321158', '1073897', '1823533', '1823535', '1091363', '1383908', '1834826', '36191', '1829641', '767536', '1829597', '1829591', '1326727', '1834700', '1317721', '1317802', '1834838', '52799', '1383915', '1320042', '1829654', '1829655', '1829658', '647089', '1829581', '1829586', '1829587', '1321116', '1829585', '1829588', '1839799', '1588509', '1834471', '1793632', '1327850', '1793599', '1456968', '1315869', '1793605', '1321236', '1829579', '1829577', '1793609', '1829571', '1829570', '1320139', '777057', '1829671', '1829566', '1831047', '1829567', '588927', '60484', '1793596', '1829634', '1839580', '1829569', '1793615', '1323529', '1793619', '1834758', '1612974', '1320007', '1839780', '1291475', '1834835', '1834453', '1823663', '418112', '1092106', '1829689', '1829688', '1793606', '647050', '1834742', '1839551', '1839553', '1834746', '1839556', '1834745', '1575978', '1834749', '1320711', '1317910', '1829700', '1839791', '1839796', '1320019', '1829494', '437131', '1829696', '1839576', '721318', '1829699', '1838874', '1315822', '647049', '1325775', '1320708', '133913', '835588', '1839564', '1320700', '1320707', '1839563', '1834737', '1834736', '1834734', '1823669', '1321159', '1320577', '1839768', '1823665', '1838602', '1823667', '1321099', '1753590', '1753593', '1320688', '1839583', '1326633', '1320681', '1793646', '1323683', '1091348', '982081', '1793648', '1478516', '1317650', '1829663', '1829667', '1829666', '1793640', '1839577', '1315855', '1317796', '1839775', '1321163', '1793642']]


    def write_file(data,flag,headers,file_name):
    # open a df & write recursively
    print " \n \n data", data
    df = pd.DataFrame(data).T
    print "df \n", df
    # write to a df recursively
    loc=file_name
    #loc="%s%s_%s"%(args.base_path,args.merchant_domain,file_name)
    if flag ==True :
        df.to_csv(loc,index=False,header=headers,sep='\t',mode='a',encoding='utf8')
        flag = False
    elif flag == False :
        df.to_csv(loc,index=False,header=False,sep='\t',mode='a',encoding='utf8')
    return loc
    # I call the function above with this data & headers, I pass flag as "True" the 1st time around, after which I write recursively with flag=False.
    write_file(data,flag=True,headers,file_name)

debug : length of original list is 155, the distorted list saved to_csv has 100 datapoints.

Purpose of loc & flag : location of file & flag = indicates whether I am writing the 1st row, or 2nd row onwrads I dont need to write headers again if the 1st row has been written already.

Here's how I solved it The main diagnosis was that I couldn't store the entire list I was passing, even if I treated it as a dict object, may be it is because of the way pandas treats the column lengths, but this is only the diagnosis. I got the whole list back, by writing the file, not with to_csv(pandas), but writing it as a simple file, and then reading it back with pandas, in which case, I could get the entire file back.

import pandas as pd
# Note that I changed my headers from the initial format as a list
headers="attribute_name\tlist_of_values"

data = ['attribute1',['1320698', '1320699', '1096323', '1320690', '1839190', '1091359', '1325750', '1569072', '1829679', '142100', '1320163', '1829673', '588914', '418137', '757085', '588910', '1321158', '1073897', '1823533', '1823535', '1091363', '1383908', '1834826', '36191', '1829641', '767536', '1829597', '1829591', '1326727', '1834700', '1317721', '1317802', '1834838', '52799', '1383915', '1320042', '1829654', '1829655', '1829658', '647089', '1829581', '1829586', '1829587', '1321116', '1829585', '1829588', '1839799', '1588509', '1834471', '1793632', '1327850', '1793599', '1456968', '1315869', '1793605', '1321236', '1829579', '1829577', '1793609', '1829571', '1829570', '1320139', '777057', '1829671', '1829566', '1831047', '1829567', '588927', '60484', '1793596', '1829634', '1839580', '1829569', '1793615', '1323529', '1793619', '1834758', '1612974', '1320007', '1839780', '1291475', '1834835', '1834453', '1823663', '418112', '1092106', '1829689', '1829688', '1793606', '647050', '1834742', '1839551', '1839553', '1834746', '1839556', '1834745', '1575978', '1834749', '1320711', '1317910', '1829700', '1839791', '1839796', '1320019', '1829494', '437131', '1829696', '1839576', '721318', '1829699', '1838874', '1315822', '647049', '1325775', '1320708', '133913', '835588', '1839564', '1320700', '1320707', '1839563', '1834737', '1834736', '1834734', '1823669', '1321159', '1320577', '1839768', '1823665', '1838602', '1823667', '1321099', '1753590', '1753593', '1320688', '1839583', '1326633', '1320681', '1793646', '1323683', '1091348', '982081', '1793648', '1478516', '1317650', '1829663', '1829667', '1829666', '1793640', '1839577', '1315855', '1317796', '1839775', '1321163', '1793642']]
flag=True
# write to a file
with open(loc, 'a') as f:
    if flag :
        f.write(headers+"\n")
        flag=False
    #Explicitly writing a tab separated file
    f.write(str(data[0])+"\t"+str(data[1])+"\n")

# read the file & confirm
df=pd.read_csv(loc,sep='\t',header='infer')
print df['list_of_values'].ix[0]
print len(df['list_of_values'].ix[0])
#Yah !! 155

Thanks to @paul who diagnosed the problem & pointed me in this direction.

ekta
  • 1,560
  • 3
  • 28
  • 57
  • Are you sure that it's truncating the data? This should never happen, can you index into the rows that you say are being truncated and display them properly – EdChum Feb 27 '15 at 10:42
  • @EdChum Adding the sample data as an edit above, try creating a dataframe with that data & then saving it to csv . – ekta Feb 27 '15 at 10:44
  • Nope, I van visually see all values even though the data is a list – EdChum Feb 27 '15 at 10:47
  • I checked this, the length of original list is 155, the distorted list saved to_csv has 100 datapoints. Try with counting len of both the original & this list ? Also how do I get rid of ",..." appended at the end of the list while saving. I opened the file in a vi editor, so the issue is not in "csv" reader so to say. – ekta Feb 27 '15 at 10:49
  • Sorry can you post code to recreate your df from your raw input data – EdChum Feb 27 '15 at 10:49
  • `flag` and `loc` are undefined in the snippet. Makes it hard to test without making up values. – Paul Feb 27 '15 at 10:58
  • @EdChum - Ofcourse they are there in my long code. Will add here so that you can run it quickly.loc is filename, and flag is an indication whether I m writing the 1st row, or 2nd row onwards. I also added what is the purpose of them. – ekta Feb 27 '15 at 10:59
  • Ignoring the flag and loc which are irrelevant your df goes to csv fine, but reloading it back in is problematic as it's parsing the data in a weird way so I get 1682 as a len as it's parsing each character rather than treating it as a list of strings – EdChum Feb 27 '15 at 11:00
  • 1
    I set `flag=false` and `loc='out.csv'` and confirm getting `...` in the output – Paul Feb 27 '15 at 11:01
  • @Paul Thank you for confirming - how about the solution now ? – ekta Feb 27 '15 at 11:02
  • Sorry to disappoint, this still works for me pandas 0.15.2, python 3.3.4 can you post your version – EdChum Feb 27 '15 at 11:03
  • @EdChum I updated the code, with loc & flag. Also I need to keep this as list, if possible. If I store this as string, then I will have to use ast.literal_eval(list_of_values.ix[index]) and so on. – ekta Feb 27 '15 at 11:04
  • `df` says it has 1 rows x 2 cols..... I assume that's not right, maybe it is an import problem. – Paul Feb 27 '15 at 11:04
  • @Paul yes. So does mine. Also, I tried using df[attribute1].astype('object') before saving it - that did not help either. – ekta Feb 27 '15 at 11:13
  • 1
    If I delete `.T` and construct the dataframe from a dict of array, I can get the full data back out. Do you want that as an answer? The [pandas docs](http://pandas.pydata.org/pandas-docs/dev/dsintro.html#dataframe) give various data structures that can be input to Dataframe constructor, including dict of array. I couldn't find the array format used in the question in the examples. – Paul Feb 27 '15 at 11:14
  • @paul - Why don't you post it as an answer - I will try if that works - and accept :) However make a note that I am passing "rows"/ data iteratively, so I will write to the dict. Also, since my data producer is HUGE, I create the dict in the function only, right - and then it drops itself ? – ekta Feb 27 '15 at 11:25
  • a quick note: Everything in pandas has to fit in ram+swap. I find 8 million rows x ~150 cols takes >16GB on one data set I have. – Paul Feb 27 '15 at 11:30
  • I am not an expert on python garbage collection, that sounds right. [You can append rows by appending dataframes together](http://stackoverflow.com/questions/24284342/insert-a-row-to-pandas-dataframe) – Paul Feb 27 '15 at 11:35
  • @Paul Did you delete your answer ? Also, I don't get the reason why you have a "push" - can't we use a series object , as in pandas documentation d = {'one' : Series([1., 2., 3.], index=['a', 'b', 'c']), ....: 'two' : Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} reference here http://pandas.pydata.org/pandas-docs/dev/dsintro.html#dataframe – ekta Feb 27 '15 at 11:56
  • 1
    @ekta Yes, I deleted the answer after you started adding more and more to the question. If you need extensive help, you can ask multiple questions, within the limits allowed on SO, or hire a programmer. Mostly people answer questions to help people learn, not provide free labor. If you think I'm being unreasonable, perhaps ask others about policy on http://meta.stackoverflow.com – Paul Feb 27 '15 at 12:02
  • @Paul I am sorry that you felt that way. You were being responsive and I was communicating exactly like I would talk if I colleague of mine jumped in to help. I am sad that you did not see I was putting in my bit as well. In any case, though you diagnosed the problem, the answer that you posted did not help retain the entire list in column #2. I am up-voting your remarks on the identification of the problem and posted what finally worked and posting the solution. The main trick was not writing the file with to_csv, and I am yet to see a solution that works with to_csv. Thank you for your time. – ekta Feb 28 '15 at 09:42
  • @etka I'm glad you found your own solutions. Including pandas is including a lot of unnecessary code normally useful for filtering and analysis for a goal of simply writing a tab-separated line of values to a file one line at a time. After some additional thought, my answer post was definitely not a solution I would want to leave for future readers. More typical solutions would include something like `print "\t".join(array_of_numbers)` and using the [CSV module](https://docs.python.org/2/library/csv.html#module-contents). There are other tutorials/answers that cover these techniques. – Paul Mar 03 '15 at 06:52

0 Answers0