2

I have the output of my script as year and the count of word from an article in that particular year :

abcd
2013
118
2014
23
xyz
2013
1
2014
45

I want to have each year added as a new column to my existing dataframe which contains only words.

Expected output:

Terms 2013  2014  2015 
abc   118   76    90
xyz   23    0     36

The input for my script was a csv file :

Terms
xyz
abc
efg

The script I wrote is :

df = pd.read_csv('a.csv', header = None)

for row in df.itertuples():
    term = (str(row[1]))
    u = "http: term=%s&mindate=%d/01/01&maxdate=%d/12/31"
    print(term)
    startYear = 2013
    endYear = 2018  

for year in range(startYear, endYear+1):
    url = u % (term.replace(" ", "+"), year, year)
    page = urllib.request.urlopen(url).read()
    doc = ET.XML(page)
    count = doc.find("Count").text
    print(year)
    print(count) 

The df.head is :

                         0
0           1,2,3-triazole
1  16s rrna gene amplicons

Any help will be greatly appreciated, thanks in advance !!

DJK
  • 8,924
  • 4
  • 24
  • 40
K.S
  • 113
  • 13
  • `the output of my script`: Is this a `list`, output from `print`, or something else? We need to know what you are starting with to help you reach your destination. – jpp Jun 21 '18 at 10:15
  • Sorry, for not being clear. It is the `list` output from `print` – K.S Jun 21 '18 at 10:20
  • Nope still not clear. What does `list` output from `print` mean? Think of it this way, what can we copy-paste into our code to replicate the object containing all those items? – jpp Jun 21 '18 at 10:25
  • I have edited my question with the script and input format. – K.S Jun 21 '18 at 10:33
  • Can you show us `df.head()` instead? – jpp Jun 21 '18 at 11:39
  • Update your question, please. No code in comments. No images / links either. – jpp Jun 21 '18 at 11:46
  • @jpp edited the question for `df.head` – K.S Jun 21 '18 at 11:48
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/173558/discussion-between-k-s-and-jpp). – K.S Jun 21 '18 at 12:19

2 Answers2

1

I would read the csv with numpy in an array, then reshape it also with numpy and then the resulting matrix/2D array to a DataFrame

Petronella
  • 2,327
  • 1
  • 15
  • 24
  • Not very familiar with numpy, can you help me with this – K.S Jun 21 '18 at 12:13
  • to read the file : https://stackoverflow.com/questions/3518778/how-to-read-csv-into-record-array-in-numpy and to reshape, sorry, not resize: https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.reshape.html and from array to DataFrame: https://pandas.pydata.org/pandas-docs/version/0.18/generated/pandas.DataFrame.html – Petronella Jun 21 '18 at 12:15
  • This is more of a comment. If you could add some sample code to solve OP's problem, that would be ideal – DJK Jun 21 '18 at 12:56
  • DJK, I do not know how the csv looks like, therfore I cannot actually give the code solution. In my comment though are links to many code examples, for each step. I did not find it fair to copy from other sources, just gave the source. – Petronella Jun 21 '18 at 14:58
0

Something like this should do it:

#!/usr/bin/env python 

def mkdf(filename):
    def combine(term, l):
        d = {"term": term}
        d.update(dict(zip(l[::2], l[1::2])))
        return d

    term = None
    other = []
    with open(filename) as I:
        n = 0
        for line in I:
            line = line.strip()
            try:
                int(line)
            except Exception as e:
                # not an int
                if term:    # if we have one, create the record
                     yield combine(term, other)

                term = line
                other = []
                n = 0
            else:
                if n > 0:
                    other.append(line)
            n += 1

        # and the last one 
        yield combine(term, other)

if __name__ == "__main__":
    import pandas as pd
    import sys

    df = pd.DataFrame([r for r in mkdf(sys.argv[1])])
    print(df)

usage: python scriptname.py /tmp/IN ( or other file with your data)

Output:

  2013 2014  term
0  118   23  abcd
1    1   45   xyz
hootnot
  • 1,005
  • 8
  • 13