Merging two CSV Files in Azure Data Factory by using custom .NET activity

Question

I have two CSV file which contains which contains many n-columns.I have to merge this two csv files with a single CSV file which having one unique column from the both input file.

I browsed thoroughly all the blogs and sites.All will result into using the custom .NET Activity.So i just go through this site

But still am not able to figure out which part in the C# Coding.Can any one share the code for how to merge this two CSV files using custom .NET Activity in Azure Data Factory.

This sounds like a database activity. Simply import the two files into Azure SQL Database tables using the Copy Activity (which natively supports import of .csv file to database) and join them there, eg with a Stored Proc activity; no custom coding required. — wBob, Jan 26 '17 at 12:17
I have an idea to use Hive activity be the simplest way to do a transformation within a single activity.How every the idea which you provided would take two activities ! — Jayendran, Jan 29 '17 at 11:41
You would spin up a whole HDInsight cluster to merge a couple of .csv files? You might want to think about the cost of different options, plus the ongoing maintainance overhead i.e. highly custom code is harder to maintain / debug / document rather than generic code using built-in tasks. The Azure Database option is a good one if you already have one available in your subscription. Thinking about it, U-SQL and its pay-as-you-go model might be a better fit. I will do you an example, can you please post some sample data and expected results? — wBob, Jan 29 '17 at 12:04
Yes i have azure database Subscription.But am not that much of familiar with U-SQL model.Since I'm new to azure.My CSV files are also big(i.e, upto 300MB).So i thought hive would be better This is what i want to do " i have 2 CSV files having nearly 20 columns each with it lets say file1.csv and file2.csv among those files i have one unique column (lets say it has "ZipCode").I want to merge this two files with that common zip code column and want the output file as output.csv. Please share your code idea too!.Because am not that much familiar with azure.! — Jayendran, Jan 29 '17 at 12:22
Can you provide some sample data and expected results please? Consider using gist. How do you want the query to behave when a record is in one file not the other etc — wBob, Jan 29 '17 at 13:29
Here the URL https://gist.github.com/jayendranarumugam/7017a3262f0e56ccb374f26e0166be78 Mostly there will be always record in both the files.For the worst case if there is no record in any one file.Simply shows an error message or log message that no records in that file name (i.e, Produce NULL value) — Jayendran, Jan 29 '17 at 14:06
Hi, you sample files are far too big. gist won't even display the second and third. Please post small files, say 10 rows each. That is all that is required. — wBob, Jan 29 '17 at 14:12
Hi, Sorry for the inconvenience here the Correct format for the sample input and output files https://gist.github.com/jayendranarumugam/19059d2ebcf3fa3f63c085da853a6b5c Ignore the previous one this will be the complete samples (2 inputs and 1 output files).Please let me know that you could able to view this.Thanks! — Jayendran, Jan 29 '17 at 14:38

score 1 · Accepted Answer · answered Jan 29 '17 at 20:18

Here is an example of how to join those two tab-separated files on Zip_Code column using U-SQL. This example assumes both files are held in Azure Data Lake Storage (ADLS). This script could easily be incorporated into a Data Factory pipeline:

// Get raw input from file A
@inputA =
    EXTRACT 
        Date_received   string,
        Product string,
        Sub_product string,
        Issue   string,
        Sub_issue   string,
        Consumer_complaint_narrative    string,
        Company_public_response string,
        Company string,
        State   string,
        ZIP_Code    string,
        Tags    string,
        Consumer_consent_provided   string,
        Submitted_via   string,
        Date_sent_to_company    string,
        Company_response_to_consumer    string,
        Timely_response string,
        Consumer_disputed   string,
        Complaint_ID    string

    FROM "/input/input48A.txt"
    USING Extractors.Tsv();


// Get raw input from file B
@inputB =
    EXTRACT Provider_ID string,
            Hospital_Name string,
            Address string,
            City string,
            State string,
            ZIP_Code string,
            County_Name string,
            Phone_Number string,
            Hospital_Type string,
            Hospital_Ownership string,
            Emergency_Services string,
            Meets_criteria_for_meaningful_use_of_EHRs string,
            Hospital_overall_rating string,
            Hospital_overall_rating_footnote string,
            Mortality_national_comparison string,
            Mortality_national_comparison_footnote string,
            Safety_of_care_national_comparison string,
            Safety_of_care_national_comparison_footnote string,
            Readmission_national_comparison string,
            Readmission_national_comparison_footnote string,
            Patient_experience_national_comparison string,
            Patient_experience_national_comparison_footnote string,
            Effectiveness_of_care_national_comparison string,
            Effectiveness_of_care_national_comparison_footnote string,
            Timeliness_of_care_national_comparison string,
            Timeliness_of_care_national_comparison_footnote string,
            Efficient_use_of_medical_imaging_national_comparison string,
            Efficient_use_of_medical_imaging_national_comparison_footnote string,
            Location string

    FROM "/input/input48B.txt"
    USING Extractors.Tsv();


// Join the two files on the Zip_Code column
@output =
    SELECT b.Provider_ID,
           b.Hospital_Name,
           b.Address,
           b.City,
           b.State,
           b.ZIP_Code,
           a.Complaint_ID

    FROM @inputA AS a
         INNER JOIN
             @inputB AS b
         ON a.ZIP_Code == b.ZIP_Code
    WHERE a.ZIP_Code == "36033";


// Output the file
OUTPUT @output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);

This could also be converted into a U-SQL stored procedure with parameters for the filenames and Zip Code.

There are of course may ways to achieve this, each with their own pros and cons. The .net custom activity for example might feel more comfortable for someone with a .net background but you'll need some compute to run it on. Importing the files into an Azure SQL Database would be a good option for someone with a SQL / database background and an Azure SQL DB in the subscription.

Merging two CSV Files in Azure Data Factory by using custom .NET activity

1 Answers1