4

I'm quite new to Julia and i have a .csv File, which is stored inside a gzip, where i want to extract some informations from for educational purposes and to get to know the language better.

In Python there are many helpful functions from Panda to help with that, but i can't seem to get the Problem straight...

This is my Code (I KNOW, VERY WEAK!!!) :

{
import Pkg 
#Pkg.add("CSV")
#Pkg.add("DataFrames")
#Pkg.add("CSVFiles")
#Pkg.add("CodecZlib")
#Pkg.add("GZip")
using CSVFiles
using Pkg
using CSV
using DataFrames
using CodecZlib
using GZip


df = CSV.read("Path//to//file//file.csv.gzip", DataFrame)

print(df)
}

I added a Screen to show how the Columns inside the .csv File are looking like. enter image description here I would like to extract the Dates and make some sort of a Top 10 most commented users, Top 10 days with the most threads etc.

I would like to point out that this is not an Exercise given to me, but a training i would like to do 4 myself.

I know the Panda Version to this is looking like this:

df['threadcreateddate'] = pd.to_datetine(df['thread_created_utc']).dt.date

or

df['commentcreateddate'] = pd.to_datetime(df['comment_created_utc']).dt.date

And to sort it:

pf_number_of_threads = df.groupby('threadcreateddate')["thread_id'].nunique()

If i were to plot it:

df_number_of_threads.plot(kind='line')
plt.show()

To print:

head = df.head()
print(df_number_of_threads.sort_values(ascending=False).head(10))

Can someone help? The df.select() function didn't work for me.

Sundar R
  • 13,776
  • 6
  • 49
  • 76
Robin
  • 63
  • 5
  • 1
    "I added a Screen to show how the Columns inside the .csv File" You forgot to add the screenshot to the question it seems. Knowing what the data looks like would help, but in general you don't need all those packages, only `CSV` and `DataFrames` are needed for this. Have you seen the [Comparison page of DataFrames.jl and pandas](https://dataframes.juliadata.org/latest/man/comparisons/#Comparison-with-the-Python-package-pandas)? That can help you started with the code, and then you can ask specific questions when you hit roadblocks, with the Julia code you have so far. – Sundar R Aug 16 '22 at 12:13
  • Hello Sundar. Thank you for your reply and sorry, i just added the Screen. I'm still not sure how to "collect" the top 10 most commenting users or somethink like that, i keep getting errors. – Robin Aug 16 '22 at 14:19

2 Answers2

5

1. Packages

We obviously need DataFrames.jl. And since we're dealing with dates in the data, and doing a plot later, we'll include Dates and Plots as well.

As this example in CSV.jl's documentation shows, no additional packages are needed for gzipped data. CSV.jl can decompress automatically. So, you can remove the other using statements from your list.

julia> using CSV, DataFrames, Dates, Plots

2. Preparing the Data Frame

You can use CSV.read to load the data into the Data Frame, as in the question. Here, I'll use some sample (simplified) data for illustration, with just 4 columns:

julia> df
6×4 DataFrame
 Row │ thread_id  thread_created_utc   comment_id  comment_created_utc 
     │ Int64      String               Int64       String              
─────┼─────────────────────────────────────────────────────────────────
   1 │         1  2022-08-13T12:00:00           1  2022-08-13T12:00:00
   2 │         1  2022-08-13T12:00:00           2  2022-08-14T12:00:00
   3 │         1  2022-08-13T12:00:00           3  2022-08-15T12:00:00
   4 │         2  2022-08-16T12:00:00           4  2022-08-16T12:00:00
   5 │         2  2022-08-16T12:00:00           5  2022-08-17T12:00:00
   6 │         2  2022-08-16T12:00:00           6  2022-08-18T12:00:00

3. Converting from String to DateTime

To extract the thread dates from the string columns we have, we'll use the Dates standard libary.

Depending on the exact format your dates are in, you might have to add a datefmt argument for conversion to Dates data types (see the Constructors section of Dates in the Julia manual). Here in the sample data, the dates are in ISO standard format, so we don't need to specify the date format explicitly.

In Julia, we can get the date directly without intermediate conversion to a date-time type, but since it's a good idea to have the columns be in the proper type anyway, we'll first convert the existing columns from strings to DateTime:

julia> transform!(df, [:thread_created_utc, :comment_created_utc] .=> ByRow(DateTime), renamecols = false)
6×4 DataFrame
 Row │ thread_id  thread_created_utc   comment_id  comment_created_utc 
     │ Int64      DateTime             Int64       DateTime            
─────┼─────────────────────────────────────────────────────────────────
   1 │         1  2022-08-13T12:00:00           1  2022-08-13T12:00:00
   2 │         1  2022-08-13T12:00:00           2  2022-08-14T12:00:00
   3 │         1  2022-08-13T12:00:00           3  2022-08-15T12:00:00
   4 │         2  2022-08-16T12:00:00           4  2022-08-16T12:00:00
   5 │         2  2022-08-16T12:00:00           5  2022-08-17T12:00:00
   6 │         2  2022-08-16T12:00:00           6  2022-08-18T12:00:00

Though it looks similar, this data frame doesn't use Strings for the date-time columns, instead has proper DateTime type values. (For an explanation of how this transform! works, see the DataFrames manual: Selecting and transforming columns.)

Edit: Based on the screenshot added to the question now, in your case you'd use transform!(df, [:thread_created_utc, :comment_created_utc] .=> ByRow(s -> DateTime(s, dateformat"yyyy-mm-dd HH:MM:SS.s")), renamecols = false).

4. Creating Date columns

Now, creating the date columns is as easy as:

julia> df.threadcreateddate = Date.(df.thread_created_utc);

julia> df.commentcreateddate = Date.(df.comment_created_utc);

julia> df
6×6 DataFrame
 Row │ thread_id  thread_created_utc   comment_id  comment_created_utc  commentcreateddate  threadcreatedate 
     │ Int64      DateTime             Int64       DateTime             Date                Date             
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │         1  2022-08-13T12:00:00           1  2022-08-13T12:00:00  2022-08-13          2022-08-13
   2 │         1  2022-08-13T12:00:00           2  2022-08-14T12:00:00  2022-08-14          2022-08-13
   3 │         1  2022-08-13T12:00:00           3  2022-08-15T12:00:00  2022-08-15          2022-08-13
   4 │         2  2022-08-16T12:00:00           4  2022-08-16T12:00:00  2022-08-16          2022-08-16
   5 │         2  2022-08-16T12:00:00           5  2022-08-17T12:00:00  2022-08-17          2022-08-16
   6 │         2  2022-08-16T12:00:00           6  2022-08-18T12:00:00  2022-08-18          2022-08-16

These could also be written as a transform! call, and in fact the transform! call in the previous code segment could have instead been replaced with df.thread_created_utc = DateTime.(df.thread_created_utc) and df.comment_created_utc = DateTime.(df.comment_created_utc). However, transform offers a very powerful and flexible syntax that can do a lot more, so it's useful to familiarize yourself with it if you're going to work on DataFrames.

5. Getting the number of threads per day

julia> gdf = combine(groupby(df, :threadcreateddate), :thread_id => length ∘ unique => :number_of_threads)
2×2 DataFrame
 Row │ threadcreateddate  number_of_threads 
     │ Date               Int64             
─────┼──────────────────────────────────────
   1 │ 2022-08-13                         1
   2 │ 2022-08-16                         1

Note that df.groupby('threadcreateddate') becomes groupby(df, :threadcreateddate), which is a common pattern in Python-to-Julia conversions. Julia doesn't use the . based object-oriented syntax, and instead the data frame is one of the arguments to the function.

length ∘ unique uses the function composition operator , and the result is a function that applies unique and then length. Here we take the unique values of thread_id column in each group, apply length to them (so, the equivalent of nunique), and store the result in number_of_threads column in a new GroupedDataFrame called gdf.

6. Plotting

julia> plot(gdf.threadcreateddate, gdf.number_of_threads)

Since our grouped data frame conveniently contains both the date and the number of threads, we can plot the number_of_threads against the dates, making for a nice and informative visualization.

Sundar R
  • 13,776
  • 6
  • 49
  • 76
  • Thank you very much for your answer Sundar, it is very helpful. I added a Screen from the DataFrame for clarification. I get the "ArgumentError: Invalid DateTime string" Error. I tried your String-to-DateTime converting command like "transform!(df, [:thread_created_utc, :comment_created_utc] .=> ByRow(DateTime("yyyy-mm-dd"), renamecols = false)" this, but it didn't help. Did I mess up the Format? Or how do i need to modify the command for that specific Format? – Robin Aug 17 '22 at 07:48
  • The format mirros your date string, with letters replacing the numbers - for eg., the 4 digit year is written as `yyyy` in the format, the space between date and time is written as a space in the dateformat too, etc. At the end you have milliseconds (`.000000`), and there alone you should have only one `s` representing any number of millisecond digits. I've updated the answer to show the command used in your case (at the end of step 3). The `s -> ` syntax I've used defines an anonymous function i.e. a lambda in Python terms. – Sundar R Aug 17 '22 at 08:23
1

As Sundar R commented it is hard to give you a precise answer for your data as there might be some relevant details. But here is a general pattern you can follow:

julia> using DataFrames

julia> df = DataFrame(id = [1, 1, 2, 2, 2, 3])
6×1 DataFrame
 Row │ id
     │ Int64
─────┼───────
   1 │     1
   2 │     1
   3 │     2
   4 │     2
   5 │     2
   6 │     3

julia> first(sort(combine(groupby(df, :id), nrow), :nrow, rev=true), 10)
3×2 DataFrame
 Row │ id     nrow
     │ Int64  Int64
─────┼──────────────
   1 │     2      3
   2 │     1      2
   3 │     3      1

What this code does:

  • groupby groups data by the column you want to aggregate
  • combine with nrow argument counts the number of rows in each group and stores it in :nrow column (this is the default, you could choose other column name)
  • sort sorts data frame by :nrow and rev=true makes the order descending
  • first picks 10 first rows from this data frame

If you want something more similar to dplyr in R with piping you can use @chain that is exported by DataFramesMeta.jl:

julia> using DataFramesMeta

julia> @chain df begin
           groupby(:id)
           combine(nrow)
           sort(:nrow, rev=true)
           first(10)
       end
3×2 DataFrame
 Row │ id     nrow
     │ Int64  Int64
─────┼──────────────
   1 │     2      3
   2 │     1      2
   3 │     3      1
Bogumił Kamiński
  • 66,844
  • 3
  • 80
  • 107
  • I Would like to ask 1 more Question: I tried to plot these and other functions, in the CMD (REPL) they show just fine, but if i were to plot them in VSC, they won't show..... I tried yours and another way to show the Plots => collection_threads.plot(x = :threadcreateddate, y = :number_of_threads, title = "Number of threads per day") collection_comments.plot(x = :commentcreateddate, y = :number_of_comments, title = "Number of comments per day"), but here i get the LoadError: ArgumentError: column name :plot not found in the data frame. With your code the programm runs and executes w/o plot – Robin Aug 18 '22 at 16:12
  • With your plot-calls i get the "Possible method call error.Julia(IncorrectCallArgs)" Warning in VSC, but i can run/execute it....... – Robin Aug 18 '22 at 16:16
  • In VSCode you need Julia extension installed and then things should work, see https://www.julia-vscode.org/docs/latest/userguide/plotgallery/. – Bogumił Kamiński Aug 18 '22 at 19:58
  • I have the Julia Extension installed. The first Problem occured when i tried to "add" the Plot function, i had the vsc (process:544): GLib-GIO-WARNING **: 09:39:19.482: Unexpectedly, UWP app `3138AweZip.AweZip_1.4.16.0_x86__ffd303wmbhcjt' (AUMId `3138AweZip.AweZip_ffd303wmbhcjt!App') supports 167 extensions but has no verbs Error, but uninstalling BreeZip fixed that. Now the Plots just won't show. I enabled the Plot Pane in the Setting of Julia, but still no Plot.... In VSC i get the Possible Method call error.Julia Warning. – Robin Aug 19 '22 at 08:47
  • I recommend that you report this problem on Julia Discourse https://discourse.julialang.org/ as you might get some feedback from the maintainers of this extension. – Bogumił Kamiński Aug 19 '22 at 09:36