0

I am trying to merge two dataframes, one containing variables like Date, Author, Paper, and IDs, the other containing texts and their IDs. I add, because it might have some importance, that the dataframe containing the texts has been obtained by converting a Vcorpus into a dataframe with the following code :

factivadf <- data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)

To achieve the merging, I use the following code :

factivaclean <- full_join(corpusVars, factiva, by = "doc_id")

And I get the following error :

Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "list"

My two original dataframes were regular dataframes, and I thought at first that the error required to apply tibble() , so I applied the function to them, but I keep getting the same error.

Here is the dput of the head of my first dataframe, corpusVars :

structure(list(corpusVars = structure(list(doc_id = c("LEPARI0020120304e833000v5", 
"HUMAN00020120301e8320001e", "LACRX00020120228e82s00017", "HUMAN00020120223e82o0001h", 
"HUMAN00020120223e82o0001g", "HUMAN00020120223e82o0000n"), Origine = c("Le Parisien-Aujourd'hui en France", 
"L'Humanité", "La Croix", "L'Humanité", "L'Humanité", "L'Humanité"
), Date = structure(c(15402, 15401, 15398, 15394, 15394, 15394
), class = "Date"), Auteur = c(NA, NA, NA, "Entretien réalisé par <U+2028>Fara C", 
"V. H.", NA)), .internal.selfref = <pointer: 0x0000024403b11ef0>, row.names = c(NA, 
6L), class = c("data.table", "data.frame"))), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Here is the dput of the second dataframe, factivadf :

structure(list(factivadf = structure(list(doc_id = c("ECHOS00020110523e75n0004j.content", 
"ECHOS00020110525e75p0000o.content1", "ECHOS00020110525e75p0000o.content2", 
"ECHOS00020110525e75p0000o.content3", "ECHOS00020110525e75p0000o.content4", 
"ECHOS00020110530e75u00019.content1"), text = c("Environ 500 personnes s'étaient donné rendez-vous hier devant le Centre Georges-Pompidou pour condamner le « sexisme » exprimé par de nombreux responsables politiques et médiatiques autour de l'affaire DSK. Une initiative portée notamment par les associations Paroles de femmes et Osez le féminisme, qui ont rappelé que 75.000 femmes sont chaque année en France victimes de viol.", 
"Le propos. La collection « Les 50 grandes idées que vous devez connaître » s'enrichit d'un nouveau titre, signé par un enseignant britannique de littérature qui se consacre maintenant à la vulgarisation des savoirs. Comme dans un dictionnaire de science politique (mais sans pontifier, ni tourner des pages autour du pot), 50 entrées promènent le lecteur de la théorie politique (liberté, égalité, tyrannie, utopie, etc.) aux matières de la politique (pauvreté, sécurité, racisme, corruption, etc.) en passant par les idéologies (anarchisme, capitalisme, socialisme, multiculturalisme, féminisme, etc.). Un petit glossaire complète le tout, avec de rapides définitions, qui auraient pu aussi constituer des idées à développer (laisser-faire, lobbying, réforme, etc.).", 
"Conçu comme un outil agréable de travail, ce livre original dispose d'un index permettant de retrouver nombre de personnages (Nicolas Sarkozy, Aristote, Aristide Briand ou bien encore la reine Victoria) à travers des pages dédiées aussi à la différence, la tyrannie, la laïcité ou le droit divin.", 
"L'intérêt. Mêlant citations et proverbes (plus ou moins célèbres), encadrés descriptifs, chronologies thématiques, tout en alternant ton sérieux (le plus souvent) et piques ironiques (voir la notice sur le politiquement correct), l'ouvrage permet, en se feuilletant, de passer un bon moment. Il a, au-delà, toute sa place dans une bibliothèque, à portée de main, pour une présentation rapide et claire de thèmes tout à fait sérieux.", 
"La citation.« La politique est supposée être la seconde plus vieille profession. J'ai fini par réaliser qu'elle ressemblait beaucoup à la première. » (Ronald Reagan)", 
"Le décret que prépare le gouvernement pour favoriser l'égalité salariale entre les hommes et les femmes est violemment critiqué par les syndicats, ce qui est courant, mais aussi par une partie de la majorité, ce qui l'est moins. Députée UMP et présidente de la délégation aux droits des femmes de l'Assemblée nationale, Marie-Jo Zimmermann ne mâche pas ses mots. « Ce décret, c'est de l'eau tiède, il ne réglera rien au problème», assure-t-elle."
)), .internal.selfref = <pointer: 0x0000024403b11ef0>, row.names = c(NA, 
6L), class = "data.frame")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

Do you know how to merge these dataframes without getting this error ?

Thank you in advance !

EDIT :

When opening it with read.table("corpusVars.csv", header = TRUE, sep = ";", na.strings = " ") , I get the following error (same with the other file, just another line being incriminated) :

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 102 did not have 4 elements

When opening it with read.csv2, here is a subset of the dput of the head of corpusVars :

structure(list(doc_id = structure(c(898L, 434L, 702L, 433L, 432L, 
431L), .Label = c("ECHOS00020110523e75n0004j", "ECHOS00020110525e75p0000o", 
"ECHOS00020110530e75u00019", "ECHOS00020110603e76300003", "ECHOS00020110615e76f0003l", 
"ECHOS00020110621e76l00021"), class = "factor"), 
    Origine = structure(c(5L, 1L, 2L, 1L, 1L, 1L), .Label = c("L'Humanité", 
    "La Croix", "La Tribune", "Le Figaro", "Le Parisien-Aujourd'hui en France", 
    "Les Echos"), class = "factor"), Date = structure(c(30L, 
    16L, 368L, 313L, 313L, 313L), .Label = c("01/02/2012", "01/02/2019", 
    "01/03/2019", "01/04/2019", "01/06/2011", "01/07/2011"), class = "factor"), 
    Auteur = structure(c(NA, NA, NA, 150L, 463L, NA), .Label = c("A.DA.", 
    "A.F.", "Adam Arroudj; 0", "Adèle Smith; adelesmith100@gmail.com", 
    "ADRIEN GOMBEAUD", "Adrien Jaulmes; ajaulmes@lefigaro.fr"), class = "factor")), row.names = c(NA, 
6L), class = "data.frame")
  • Have you tried `merge`? – Chris Ruehlemann Mar 06 '20 at 09:09
  • 2
    I guess this `.internal.selfref = ,`should not be included in the data? – Chris Ruehlemann Mar 06 '20 at 09:13
  • Yes, but I don't know why it exists and how to delete it – Maître Cheminade Mar 06 '20 at 10:35
  • @ChrisRuehlemann yes, and I get a table with 5 obs. of 0 variables, "No data available in table". I guess it is because of the `.internal.selfref = ` , but I don't know why this exists – Maître Cheminade Mar 06 '20 at 10:39
  • Opening the .csv file with `read_csv2` instead of `fread` seems to make it disappears, but the result after merging is still the same – Maître Cheminade Mar 06 '20 at 10:45
  • What does the data look like when you read it in using `read_csv2`? And is there a reason why you do not read in the data using, for example, `read.table`? – Chris Ruehlemann Mar 06 '20 at 10:58
  • (I answered by editing the question because of character limit) – Maître Cheminade Mar 06 '20 at 11:19
  • When you use `read.table` you should set `sep = "\t"`; also, not sure whether `na.strings = " "` is correct as this only converts to NA those cells that have 1 white space character, but it does not convert *empty* cells. To set these to NA use `na.strings = ""` (without space between the quite marks!). – Chris Ruehlemann Mar 06 '20 at 11:30
  • The separator is a " ; ", ` sep = "\t" ` doesn't identify columns (just puts everything in the same one). Opening the files with `read.table` gives a blank table "no data available in table" after merging with `merge()`. – Maître Cheminade Mar 06 '20 at 12:01
  • Well, `sep = "\t"` normally *does* identify columns... – Chris Ruehlemann Mar 06 '20 at 12:05
  • Not in this case, probably because it was originally written with write.csv2, which I prefer because of the " ; " separator – Maître Cheminade Mar 06 '20 at 12:07
  • 1
    Okay, not knowing what data you have, what I'd do if I were you is copy and paste the whole .csv file into some .txt editor, save it as .txt and read it in using `read.table`; also, you may have to experiment with adjusting arguments to `read.table`, check out `?read.table` – Chris Ruehlemann Mar 06 '20 at 12:10
  • I added a short answer which is mainly a patch. Did you try to open the csv file with notepad? I use `write.csv2` a lot myself and it never did this to me. – Dan Chaltiel Mar 06 '20 at 13:03

1 Answers1

1

There seems to be a problem with your reading function.

The output is not a common dataframe object, but rather some sort of list containing only a dataframe object.

Indeed, this line seems to work and give a proper merged dataframe:

full_join(corpusVars$corpusVars, factivadf$factivadf, by = "doc_id")

Of note, as Chris said, .internal.selfref = <pointer: 0x0000024403b11ef0> should not be included and I had to remove it from your dput output for the example to work. This indeed seems to be related to fread: Warning: 'Invalid .internal.selfref detected' when adding a column to a data.table returned from a function

Dan Chaltiel
  • 7,811
  • 5
  • 47
  • 92