00:11
Today, we wanted to talk about another, it's the time of the month where we do our 科学系ポッドキャストの企画会なんですけど
This month, it's hosted by 宇宙話と隣のデータ分析屋さんをやっている 佐々木亮さんという方がホストしています
The theme this month is data, which is pretty broad, honestly. だいぶざっくり
That just means we can cook it in many, many different ways. So I'm pretty excited to hear what other people have to say about data in general.
I mean, for us, as researchers, data is our love-hate relationship, right? We need it, we want them, but sometimes they give us the hardest time
In terms of, it could be our headache, it could be the most exciting thing to get. I don't know.
And it's also very different for everyone, obviously very different between social sciences versus natural sciences.
It also really depends on the kind of experiment that you're doing, where is the source of the data coming from?
But here, I guess, let's just focus on the data where it's the primary source, right?
So not the secondary source of like, it's not metadata, where let's just focus on the data that we can get.
Yeah, so any thoughts on that?
So I usually collect, in most cases, I collect human behavior data and also some physiological measurements from human subjects.
And in terms of physiological measurements, there's not so large inter-individual differences, because it's physiological.
It depends on how we collect, the quality of the data itself.
But what I hear from social science people, they, you know, they're interested in human, I don't know, communications, styles, or like interactions.
Yeah, like interactions. Yeah, exactly.
In that case, there's a huge, huge inter-individual differences.
So what I hear is that they would need 100 number of people to see something.
03:04
Right, I mean, like, depending on what they're looking at, it could be like, you need a population-wide kind of data.
And, you know, you might need like thousands and or maybe even 1000 is not enough N to, you know, get the reliable or say anything statistic about that specific behavior or mechanism of some kind of interactions.
So what I'm curious is, how is it in your field? Because it's so different, the data you collect, and yeah, what is the, how large is it, you know, do you need to collect data to or what is the source of the noises or the, or the, how do you say, it's like,
when you, when you find outlier values, how would you, how would you treat those values to statistically investigate something in your field?
So, yeah, specifically for my PhD research, we rely on the fact that quantum mechanics is deterministic, it doesn't matter if you shoot the cyclohexadiene now versus cyclohexadiene, you know, tomorrow, it's, it's gonna do the same thing, as long as it, as long as you trigger the reaction the same way.
So, in that way, the sort of expected errors, the margin of errors is probably pretty small compared to the data that you're getting.
My naive question is, because in my field, it's always indirect, any measurements are indirect.
Yeah.
So we don't know what the correct answers are. But there are some measurements that shows the quality of the data itself. So in your, in your field, how would you measure the quality? How do you know it's the true, true measurement?
Oh, so in terms of verifying our sort of validity of the data, we don't, unlike biological experiments, we don't need to be able to, I mean, we do need to be able to reproduce, of course, like, to a certain extent, but we don't need to repeat the same experiment, I don't know, 10 times over just to prove a point. Like, if you get a data, that's, that's the correct, that's the correct answer.
Sort of?
Yeah, given that your signal is well above your noise. So the noise sources in my experiment, in my PhD research experiment, largely comes from the fact that two laser beams need to overlap in space and time, right? And laser has inherent, you know, instabilities, you know, they cannot, like, it's affected by the temperature, the humidity of the lab,
06:17
you know, small things like that, or even though our laser table is floated, meaning the entire optics setup is on the table that has nitrogen flowing in it, so that it would dampen any vibration from the building. So, but so like, if people, you know, walk around the laser table, that doesn't affect our experiment.
But if there is a construction going, while we're doing experiment, that might get affected.
And then you will notice because those noises are huge?
Again, depends if you know exactly when the construction was happening during your, let's say, three day scan. And if unusual noise profile matches with the time that they were doing jackhammering,
then maybe you can guess that. But most of the time, our, our noise sources are things like the laser itself, the overlap of the two lasers, so that's two unstable things coming together. So you know, that's extra. And the shot noises, like,
when the laser, the pulse gets ejected, there's a hit something, there's a shot noise. And that's, but those are like systematic errors, where if you scan over and over again, you can reduce those noise. So, and how noisy the data is, again, depends on how good of a condition your experiment is, how easy it is to do your molecule.
Because we trigger the molecule with lasers, it's a photochemical reaction. A lot of it depends on cross section. So like the probability of how likely is the molecule going to absorb this light that you're shooting at it. And if the molecule has high cross section, that means
every time the molecule gets hit, there's higher chance of it actually absorbing the energy from the photon and doing the reaction. If it has a low cross section, then even if it gets hit, it might not absorb the energy in a way that you want it to. So if it has, if your particular molecule has high absorption cross section, then it's fairly straightforward. You just do the scan a bunch of times, like maybe for some molecules, you know, let's say,
like 12 hours scan is sufficient to get a nice clean trace. But some other molecules need two or three days to get to the similar level of signal-to-noise ratio. Also depends on the exact condition you're doing. So typically, if you're doing the same molecule, but different temperature, the colder temperature would have, would mean that the molecular beam is less dense.
09:24
And therefore, the noise is higher. It also depends on the laser intensity as well.
For the same reason, you know, whether the energy cross section is high or low. And also,
sometimes too strong of energy is not good, like too strong, too intense laser pulse is not good for my
experiment, because we want to look at the reaction when one first pulse comes in and the
second pulse of different color comes in. So it should be one-to-one, one-plus-one kind of experiment.
But if your first pulse is very, very intense, then sometimes unintentionally it becomes a one-plus-one-plus-one.
And you don't, and you, you see a different signal. So your actual signal that you're
looking for is diluted. And you will need to, you know, do some subtraction to get rid of it,
which we typically don't want to do. So it really depends. But the nice thing about our experiment
is that as long as the instrument is stable, which is not given, by the way, it's not given,
but if it is stable, then we can run it, you know, overnight, many, many days, and keep collecting
statistics, because most of these noise are systematic error. If you do more, you average more,
and you have nicer data. Yeah, you have nicer data. Even then, of course, sometimes your
signal is so small, and that might just have to do with your experimental condition, the limit of
your instrument, or, you know, wrong data collection. Yeah, like, you know, at some point,
it's going to plateau, right? Like, the quality of data is going to plateau, like, you know,
whether you do three days or 10 days, it's not going to make a marginal difference. And
typically, you know, I don't really do experiment until I find a plateau, right? Like, I just do the
experiment until I think it's good enough. Like, okay, based on my experiment, experience,
this looks like a clean data. And I can do, like, because after taking the raw data, you will
massage it, you will need to do some processing of signal.
It's so different from, your field is so different from mine.
12:00
Yeah, yeah. Because you cannot, you cannot, like, hold people over multiple days. You can't, like,
make them sit in your room for, like, multiple days, you can't make them, you can't expect them
to behave the same way. If you make them come back the next day, you know, that I think is
really difficult. So, how do you know? So, you mentioned, you don't usually collect data until
the signal is plateaued. So, you observe. So, you observe data, usually, while you're
doing experiments. And then you kind of look at the time course of...
Yeah, yeah. So, we have a way where we can look at data per scan, right? So, for us, when I say
scan, it means, let's say, you decide that you want to take a time-dependent data from
minus 500 to 500 femtoseconds range, in 10 femtoseconds increments.
Let's say that. And then for each time the scan is done from 1 minus 500 to plus 500,
it shows up a plot. And then when you do the second one, when they do the second round,
it will show me its individual plot, its individual scan quality, but also the average
of the two. So, I can keep track of the signal, and I can see when it gets... It gets better and
better as they continue, right? Unless they run out of sample or completely have drastic change
in a laser condition or something. Then I would know. I can catch that in the scan as well.
But yeah, there's a way to sort of check it. It's not in vivo, is it? It's... What's the
right word for that? In situ? Oh, you're eating something. That's all right. A girl's gotta eat.
Oh, yeah. Anyway, like...
So, it's not like you scan for a few times and like, surprise. If it's bad after... I mean,
you do check in, right? Like, every few... Especially the first few hours, I typically
check in every five or so scans to make sure everything is all right. If it looks stable,
if it's behaving the way it's supposed to be behaving, I have a few diagnostic tools, right?
15:05
I'm looking at a bunch of things at the same time, but as long as those look stable,
pressure looks stable, temperature looks stable, then I can let it run overnight and
then forget about it until the next morning. And then as long as the sample hasn't run out or
laser completely freaked out overnight, I will usually have a data after overnight scan the
morning of. And then in the morning, I'll come in and check, see if I want to add some more sample,
or if I think that the scan is good enough, I should move on to different sample or different
experimental condition, something like that. So that's a nice thing about my experiment.
Like when it is working, it's a pretty robust instrument. It's just that it's not working like
300 days out of 365 days. So yeah, that's it for the show today. Thanks for listening and find us
at Agodescience on Twitter. That is E-I-G-O-D-E-S-C-I-E-N-C-E. See you next time!